Question

How do people know to use at least 30X coverage in WGS?

3

Entering edit mode

7.5 years ago

DVA ▴ 630

Hello,

I heard many times from different sources that if I'm doing a WGS for SNVs detection, I better have a >=30X coverage after removing duplication. Of curiosity, how did scientists come to this coverage please?

Did some studies compare the result of one sample with 100X coverage (or some coverage deep enough to be a standard) to 30X of a same individual, and conclude that 30X can just as well do a good job? Thanks a lot.

sequencing wgs coverage • 9.6k views

ADD COMMENT • link updated 7.4 years ago by Brian Bushnell 20k • written 7.5 years ago by DVA ▴ 630

2

Entering edit mode

What Is Considered A Good Coverage Depth In Exon Capture Seq
https://www.ncbi.nlm.nih.gov/pubmed/18987734

ADD REPLY • link 7.5 years ago by GenoMax 141k

0

Entering edit mode

Thank you very much.

ADD REPLY • link 7.5 years ago by DVA ▴ 630

1

Entering edit mode

In my opinion it really depends on what your research question is. If it disease/clinical related you would like to be sure that a variants is there and you dont want the hassle of validating variants with Sanger sequencing so therefore 30X is relative good coverage. Usually a heterozygosity rate of <75% is used so that would mean that at least 7 reads are needed to call a variant in a 30x covered piece of genome... See for a longer discussion also this post: What Is Considered A Good Coverage Depth In Exon Capture Seq

ADD REPLY • link 7.5 years ago by Floris Brenk ★ 1.0k

0

Entering edit mode

Thanks a lot for the reply:)

ADD REPLY • link 7.5 years ago by DVA ▴ 630

0

Entering edit mode

Here's a more recent analysis of sensitivity vs. read depth for WGS and WXS: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-247

ADD REPLY • link 7.5 years ago by harold.smith.tarheel ★ 4.9k

0

Entering edit mode

Thank you for the information

ADD REPLY • link 7.5 years ago by DVA ▴ 630

0

Entering edit mode

Another reference, about advised coverage in exome sequencing: http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-195

We estimate a local read depth of 13X is required to detect the alleles and genotype of a heterozygous SNV 95% of the time, but only 3X for a homozygous SNV. At a mean on-target read depth of 20X, commonly used for rare disease exome sequencing studies, we predict 5-15% of heterozygous and 1-4% of homozygous SNVs in the targeted regions will be missed.

But as said by others it really depends on what you are doing. De novo sequencing or resequencing, short or long reads, CNV detection or SNP detection, research or diagnostic,...

ADD REPLY • link 7.5 years ago by WouterDeCoster 47k

0

Entering edit mode

I do SNP detection. Thanks a lot for the information.

ADD REPLY • link 7.5 years ago by DVA ▴ 630

score 4 · Answer 1 · 2016-11-09

The necessary coverage depends on the platform and run mode, too. Illumina's newer NextSeq platform, for example, has much lower quality and much less accurate quality scores than their top-quality MiSeq platform, as well as shorter reads. All three of those factors influence how much coverage is needed to accurately call variants. WGS needs lower coverage than exon-capture, though, because it has less bias. Using a NextSeq instead of a HiSeq/MiSeq might double your coverage target; and exon-capture might triple it.

Additionally, Illumina's newer software versions with quantized quality scores are simply not very good for calling variants, which would again increase the necessary coverage for a given confidence level. It's possible to recalibrate the quality scores which will restore the full quality-score range and thus make it possible to more-accurately distinguish SNVs from sequencing error, reducing the necessary coverage, but it's better to just select a platform that does not quantize quality scores in the first place. The newer 2-dye chemistries also seem to decrease quality, and patterned flow-cells decrease average insert size (longer inserts help resolve repeats), so the newer platforms with 2-dye chemistry or patterned flow-cells need more coverage for accurate variant calling.

I'm currently evaluating some NextSeq data from a fungus with 120x coverage. Some of the SNPs are present in 97% of reads; it's pretty obvious they are real. Some are present in 1 read only; they appear to be sequencing error. Some are present in around 25% of reads, with a kind of low average quality score. I'm really not sure about those - are they real? Sequencing error? A collapsed 4-copy repeat in the assembly? If this was MiSeq or HiSeq 2500 data, it would be obvious. But with current NextSeq data, the lowest possible quality score is 14, which indicates over 95% confidence that the call is correct. I have no idea what they are. Others variants are scattered around whole coverage scale, between 2x and 120x; with inaccurate calls and quality scores, it's impossible to accurately call any variants or their ploidy unless you do massive oversequencing, and 30x would absolutely not be sufficient for a haploid, let alone a diploid.