3.4 years ago by
Walnut Creek, USA
The necessary coverage depends on the platform and run mode, too. Illumina's newer NextSeq platform, for example, has much lower quality and much less accurate quality scores than their top-quality MiSeq platform, as well as shorter reads. All three of those factors influence how much coverage is needed to accurately call variants. WGS needs lower coverage than exon-capture, though, because it has less bias. Using a NextSeq instead of a HiSeq/MiSeq might double your coverage target; and exon-capture might triple it.
Additionally, Illumina's newer software versions with quantized quality scores are simply not very good for calling variants, which would again increase the necessary coverage for a given confidence level. It's possible to recalibrate the quality scores which will restore the full quality-score range and thus make it possible to more-accurately distinguish SNVs from sequencing error, reducing the necessary coverage, but it's better to just select a platform that does not quantize quality scores in the first place. The newer 2-dye chemistries also seem to decrease quality, and patterned flow-cells decrease average insert size (longer inserts help resolve repeats), so the newer platforms with 2-dye chemistry or patterned flow-cells need more coverage for accurate variant calling.
I'm currently evaluating some NextSeq data from a fungus with 120x coverage. Some of the SNPs are present in 97% of reads; it's pretty obvious they are real. Some are present in 1 read only; they appear to be sequencing error. Some are present in around 25% of reads, with a kind of low average quality score. I'm really not sure about those - are they real? Sequencing error? A collapsed 4-copy repeat in the assembly? If this was MiSeq or HiSeq 2500 data, it would be obvious. But with current NextSeq data, the lowest possible quality score is 14, which indicates over 95% confidence that the call is correct. I have no idea what they are. Others variants are scattered around whole coverage scale, between 2x and 120x; with inaccurate calls and quality scores, it's impossible to accurately call any variants or their ploidy unless you do massive oversequencing, and 30x would absolutely not be sufficient for a haploid, let alone a diploid.