Question: Assess and correct any GC bias in Illumina HiSeq 4000 data
gravatar for Anand Rao
3.0 years ago by
Anand Rao320
United States
Anand Rao320 wrote:

I am using REAPR software to assess correctness of my de novo genome assembly from HiSeq4000 Illumina reads - this software simply compares theoretical predictions of uniform coverage to actual observed coverage based on mapping component reads back to the inferred assembly. The coverage Vs GC plot from this mapping is shown at this link -

This means there is GC-bias in coverage exists, but is very minimal because the slope (solid black line) is almost zero, correct? I find it surprising that average fragment coverage here is < 5X, whereas crude per-nucleotide coverage, based on Lander/Waterman equation C = LN/G =~ 90X!

If indeed there is some GC bias contributing to coverage depth differences, then is it beyond what is to be expected of my data (haploid, fungal spores) or any data of this type? I ask because there are ~ 90K errors reported by REAPR where observed fragment coverage is ZERO, where the assembly should be, but is NOT broken or gapped at these positions! These are obvious mis-joints, right?

Is it known how and why GC-content dependent coverage bias can result in mis-assemblies during de novo assembly? I've tried a5, a5miseq, ABySS, SPAdes with different pre-processing variations - but to no avail, REAPR always reports about 90K errors.

Related to this potential GC-bias introduced-error, are there other software such as deep Tools' computeGCBias and correctGCBias and GCcorrect, for my purposes? And what is the general consensus about the need and efficacy of such GC bias correction steps / tools - yes, no or dependent on data?

And finally, if it is recommended, at what stage of my pre-processing is this suggested with respects to other steps such as adapter-removal, quality trimming, length filtering, filtering tile, Error Correction etc? THANKS!

ADD COMMENTlink modified 3.0 years ago by Brian Bushnell17k • written 3.0 years ago by Anand Rao320
gravatar for Brian Bushnell
3.0 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

For my suggested order of preprocessing prior to assembly, I suggest you look at BBMap's example shellscript in /bbmap/pipelines/

I don't think you can usefully correct GC bias. It's just there and can cause holes in your assembly due to insufficient coverage in some cases. A co-worker of mine developed some tools to analyze and correct GC-bias, but it did not improve assemblies, presumably because all they can do is downsample and what's really needed is more coverage in the underrepresented areas.

ADD COMMENTlink written 3.0 years ago by Brian Bushnell17k

Indeed, Brian, I have tried your steps apart from other combinations of (assemblers+pre-processing steps).

Each time, I obtain REAPR results that indicate mis-assemblies and mis-joints that should really be gaps - on the order of 80-90K, for each of my tested de novo assembly protocol - with a5, a5miseq, SPAdes, ABySS, and your pipeline with more involved pre-processing.

So my problem appears to be the reverse of what should be expected? Rather than holes in my assembly, several of these expected holes are joined together incorrectly, and REAPR is calling out these errors? That is very strange, if I am not mistaken!

ADD REPLYlink written 3.0 years ago by Anand Rao320

Well, there are two possibilities... either REAPER is wrong, or preprocessing is causing more misassemblies. Either is possible. In general overly-aggressive preprocessing can cause more misassemblies, while simultaneously increasing continuity, and finding the right balance can be difficult. The steps I listed I tend to be beneficial in bacterial isolates and metagenomes. I have not tested them on haploid fungi, but... how do you sequence a fungal spore, anyway? Does it have just one copy of DNA that needs to be heavily amplified with MDA, or do you culture it, or what? Some preprocessing does not play well with heavy amplification.

I generally recommend when qualifying a pipeline, to start with data for which you know the correct answer - e.g., first sequence something with a finished reference, optimize the pipeline, then run your real sample through the optimized process and hope for the best since it is very difficult to confidently call misassemblies without a reference.

ADD REPLYlink written 3.0 years ago by Brian Bushnell17k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1541 users visited in the last hour