Question: Assess and correct any GC bias in Illumina HiSeq 4000 data
1
gravatar for Anand Rao
21 months ago by
Anand Rao210
United States
Anand Rao210 wrote:

I am using REAPR software to assess correctness of my de novo genome assembly from HiSeq4000 Illumina reads - this software simply compares theoretical predictions of uniform coverage to actual observed coverage based on mapping component reads back to the inferred assembly. The coverage Vs GC plot from this mapping is shown at this link - https://ibb.co/k3Efvb

This means there is GC-bias in coverage exists, but is very minimal because the slope (solid black line) is almost zero, correct? I find it surprising that average fragment coverage here is < 5X, whereas crude per-nucleotide coverage, based on Lander/Waterman equation C = LN/G =~ 90X!

If indeed there is some GC bias contributing to coverage depth differences, then is it beyond what is to be expected of my data (haploid, fungal spores) or any data of this type? I ask because there are ~ 90K errors reported by REAPR where observed fragment coverage is ZERO, where the assembly should be, but is NOT broken or gapped at these positions! These are obvious mis-joints, right?

Is it known how and why GC-content dependent coverage bias can result in mis-assemblies during de novo assembly? I've tried a5, a5miseq, ABySS, SPAdes with different pre-processing variations - but to no avail, REAPR always reports about 90K errors.

Related to this potential GC-bias introduced-error, are there other software such as deep Tools' computeGCBias and correctGCBias and GCcorrect, for my purposes? And what is the general consensus about the need and efficacy of such GC bias correction steps / tools - yes, no or dependent on data?

And finally, if it is recommended, at what stage of my pre-processing is this suggested with respects to other steps such as adapter-removal, quality trimming, length filtering, filtering tile, Error Correction etc? THANKS!

ADD COMMENTlink modified 21 months ago by Brian Bushnell16k • written 21 months ago by Anand Rao210
2
gravatar for Brian Bushnell
21 months ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

For my suggested order of preprocessing prior to assembly, I suggest you look at BBMap's example shellscript in /bbmap/pipelines/assemblyPipeline.sh.

I don't think you can usefully correct GC bias. It's just there and can cause holes in your assembly due to insufficient coverage in some cases. A co-worker of mine developed some tools to analyze and correct GC-bias, but it did not improve assemblies, presumably because all they can do is downsample and what's really needed is more coverage in the underrepresented areas.

ADD COMMENTlink written 21 months ago by Brian Bushnell16k

Indeed, Brian, I have tried your assemblyPipeline.sh steps apart from other combinations of (assemblers+pre-processing steps).

Each time, I obtain REAPR results that indicate mis-assemblies and mis-joints that should really be gaps - on the order of 80-90K, for each of my tested de novo assembly protocol - with a5, a5miseq, SPAdes, ABySS, and your pipeline with more involved pre-processing.

So my problem appears to be the reverse of what should be expected? Rather than holes in my assembly, several of these expected holes are joined together incorrectly, and REAPR is calling out these errors? That is very strange, if I am not mistaken!

ADD REPLYlink written 21 months ago by Anand Rao210

Well, there are two possibilities... either REAPER is wrong, or preprocessing is causing more misassemblies. Either is possible. In general overly-aggressive preprocessing can cause more misassemblies, while simultaneously increasing continuity, and finding the right balance can be difficult. The steps I listed I tend to be beneficial in bacterial isolates and metagenomes. I have not tested them on haploid fungi, but... how do you sequence a fungal spore, anyway? Does it have just one copy of DNA that needs to be heavily amplified with MDA, or do you culture it, or what? Some preprocessing does not play well with heavy amplification.

I generally recommend when qualifying a pipeline, to start with data for which you know the correct answer - e.g., first sequence something with a finished reference, optimize the pipeline, then run your real sample through the optimized process and hope for the best since it is very difficult to confidently call misassemblies without a reference.

ADD REPLYlink written 20 months ago by Brian Bushnell16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 961 users visited in the last hour