Correlation between coverage and variant calling
2
0
Entering edit mode
2.3 years ago
vinayjrao ▴ 210

Hi,

I analyzed a few human Exome-Seq data sets, and I noticed that their fastq files were around 2.5 Gb each. I am analyzing another set, where the fastq files are around 900 Mb. After aligning both with hg38, and following the same pipeline, I noticed upon variant calling (GATK HaplotypeCaller) that the number of variants in the older data set (2.5 Gb fastq) were approximately 300,000, while in the current data set (900 Mb fastq), there are only around 100,000 variants.

I understand that a higher coverage would have given the software more confidence to identify variants, but is it possible to observe a linear correlation in the decrease of variants by approximately 3 times upon reduction in the number of reads by approximately 3 times, or is there something that I'm missing?

Thank you.

Exome-Seq SNP variant_calling • 545 views
3
Entering edit mode
2.3 years ago

I understand that a higher coverage would have given the software more confidence to identify variants

Not necessarily.

Anecdotal evidence constantly suggests a relationship between position read depth [at which a variant is being called] and the false-positive and false-negative rate of variant calling.

To keep this short, you can break this dependency by random sampling your reads from your main aligned and QCd BAM, and then re-calling variants on each random sub-set. At the end, you then take the consensus calls. I elaborate on this, here: A: Best tool for variant calling

Kevin

3
Entering edit mode

I would argue that it strongly depends on the variant caller. If you use tools like VarScan2 which use a statistical framework to calculate probabilities for a certain genotype to be present then decreasing read numbers will reduce power and therefore subsampling would reduce the number and confidence of variants. Not sure how GATK calls cariants though.

0
Entering edit mode

That too - yes (i.e. the variant caller)

0
Entering edit mode

Thank you both for the insight. Although, it is still not clear to me whether a decrease in fastq file size will linearly reduce the number of variants?

1
Entering edit mode
2.3 years ago
harish ▴ 410

Also there are quite a few relevant blog posts from Brad Chapman about this on bcbio and a lovely article by Heng Li.

Production scripts on variant filtering in bcbio toolkit: https://github.com/bcbio/bcbio-nextgen/blob/98c12fdaa8ce6ab9c6c1fdfb4db39df9c7b548ff/bcbio/variation/vfilter.py#L120

While some of these posts might be older, they still offer an immense value.