I am using REAPR software to assess correctness of my de novo genome assembly from HiSeq4000 Illumina reads - this software simply compares theoretical predictions of uniform coverage to actual observed coverage based on mapping component reads back to the inferred assembly. The coverage Vs GC plot from this mapping is shown at this link - https://ibb.co/k3Efvb
This means there is GC-bias in coverage exists, but is very minimal because the slope (solid black line) is almost zero, correct? I find it surprising that average fragment coverage here is < 5X, whereas crude per-nucleotide coverage, based on Lander/Waterman equation C = LN/G =~ 90X!
If indeed there is some GC bias contributing to coverage depth differences, then is it beyond what is to be expected of my data (haploid, fungal spores) or any data of this type? I ask because there are ~ 90K errors reported by REAPR where observed fragment coverage is ZERO, where the assembly should be, but is NOT broken or gapped at these positions! These are obvious mis-joints, right?
Is it known how and why GC-content dependent coverage bias can result in mis-assemblies during de novo assembly? I've tried a5, a5miseq, ABySS, SPAdes with different pre-processing variations - but to no avail, REAPR always reports about 90K errors.
Related to this potential GC-bias introduced-error, are there other software such as deep Tools' computeGCBias and correctGCBias and GCcorrect, for my purposes? And what is the general consensus about the need and efficacy of such GC bias correction steps / tools - yes, no or dependent on data?
And finally, if it is recommended, at what stage of my pre-processing is this suggested with respects to other steps such as adapter-removal, quality trimming, length filtering, filtering tile, Error Correction etc? THANKS!