Question: How do people know to use at least 30X coverage in WGS?
gravatar for DVA
3.5 years ago by
United States
DVA530 wrote:


I heard many times from different sources that if I'm doing a WGS for SNVs detection, I better have a >=30X coverage after removing duplication. Of curiosity, how did scientists come to this coverage please?

Did some studies compare the result of one sample with 100X coverage (or some coverage deep enough to be a standard) to 30X of a same individual, and conclude that 30X can just as well do a good job? Thanks a lot.

sequencing coverage wgs • 5.7k views
ADD COMMENTlink modified 3.4 years ago by Brian Bushnell17k • written 3.5 years ago by DVA530

What Is Considered A Good Coverage Depth In Exon Capture Seq

ADD REPLYlink written 3.5 years ago by genomax80k

Thank you very much.

ADD REPLYlink written 3.5 years ago by DVA530

In my opinion it really depends on what your research question is. If it disease/clinical related you would like to be sure that a variants is there and you dont want the hassle of validating variants with Sanger sequencing so therefore 30X is relative good coverage. Usually a heterozygosity rate of <75% is used so that would mean that at least 7 reads are needed to call a variant in a 30x covered piece of genome... See for a longer discussion also this post: What Is Considered A Good Coverage Depth In Exon Capture Seq

ADD REPLYlink written 3.5 years ago by Floris Brenk900

Thanks a lot for the reply:)

ADD REPLYlink written 3.5 years ago by DVA530

Here's a more recent analysis of sensitivity vs. read depth for WGS and WXS:

ADD REPLYlink written 3.5 years ago by harold.smith.tarheel4.5k

Thank you for the information

ADD REPLYlink written 3.5 years ago by DVA530

Another reference, about advised coverage in exome sequencing:

We estimate a local read depth of 13X is required to detect the alleles and genotype of a heterozygous SNV 95% of the time, but only 3X for a homozygous SNV. At a mean on-target read depth of 20X, commonly used for rare disease exome sequencing studies, we predict 5-15% of heterozygous and 1-4% of homozygous SNVs in the targeted regions will be missed.

But as said by others it really depends on what you are doing. De novo sequencing or resequencing, short or long reads, CNV detection or SNP detection, research or diagnostic,...

ADD REPLYlink written 3.5 years ago by WouterDeCoster43k

I do SNP detection. Thanks a lot for the information.

ADD REPLYlink written 3.5 years ago by DVA530
gravatar for Brian Bushnell
3.4 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

The necessary coverage depends on the platform and run mode, too. Illumina's newer NextSeq platform, for example, has much lower quality and much less accurate quality scores than their top-quality MiSeq platform, as well as shorter reads. All three of those factors influence how much coverage is needed to accurately call variants. WGS needs lower coverage than exon-capture, though, because it has less bias. Using a NextSeq instead of a HiSeq/MiSeq might double your coverage target; and exon-capture might triple it.

Additionally, Illumina's newer software versions with quantized quality scores are simply not very good for calling variants, which would again increase the necessary coverage for a given confidence level. It's possible to recalibrate the quality scores which will restore the full quality-score range and thus make it possible to more-accurately distinguish SNVs from sequencing error, reducing the necessary coverage, but it's better to just select a platform that does not quantize quality scores in the first place. The newer 2-dye chemistries also seem to decrease quality, and patterned flow-cells decrease average insert size (longer inserts help resolve repeats), so the newer platforms with 2-dye chemistry or patterned flow-cells need more coverage for accurate variant calling.

I'm currently evaluating some NextSeq data from a fungus with 120x coverage. Some of the SNPs are present in 97% of reads; it's pretty obvious they are real. Some are present in 1 read only; they appear to be sequencing error. Some are present in around 25% of reads, with a kind of low average quality score. I'm really not sure about those - are they real? Sequencing error? A collapsed 4-copy repeat in the assembly? If this was MiSeq or HiSeq 2500 data, it would be obvious. But with current NextSeq data, the lowest possible quality score is 14, which indicates over 95% confidence that the call is correct. I have no idea what they are. Others variants are scattered around whole coverage scale, between 2x and 120x; with inaccurate calls and quality scores, it's impossible to accurately call any variants or their ploidy unless you do massive oversequencing, and 30x would absolutely not be sufficient for a haploid, let alone a diploid.

ADD COMMENTlink written 3.4 years ago by Brian Bushnell17k

Illumina has been talking about quality binning for over six years. I have seen multiple lines of evidence from Illumina, Broad, Sanger and my own experiment that quality binning has little to do with the quality of variant calling for human data. I rarely work with nextseq data. I did hear complaints about its data quality from time to time, but I also know people can make variant calls of acceptable quality.

Some are present in around 25% of reads, with a kind of low average quality score.

They may be caused by systematic sequencing errors. HiSeq X10 is getting worse in this aspect. Nextseq may be even worse. A good heuristic is to ignore low-quality bases (e.g. below Q20). Correlated errors tend to have lower base quality. GATK folks told me this ~8 years ago and I think they are right.

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by lh332k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 847 users visited in the last hour