Correlation between coverage and variant calling
2
0
Entering edit mode
2.3 years ago
vinayjrao ▴ 210

Hi,

I analyzed a few human Exome-Seq data sets, and I noticed that their fastq files were around 2.5 Gb each. I am analyzing another set, where the fastq files are around 900 Mb. After aligning both with hg38, and following the same pipeline, I noticed upon variant calling (GATK HaplotypeCaller) that the number of variants in the older data set (2.5 Gb fastq) were approximately 300,000, while in the current data set (900 Mb fastq), there are only around 100,000 variants.

I understand that a higher coverage would have given the software more confidence to identify variants, but is it possible to observe a linear correlation in the decrease of variants by approximately 3 times upon reduction in the number of reads by approximately 3 times, or is there something that I'm missing?

Thank you.

Exome-Seq SNP variant_calling • 545 views
ADD COMMENT
3
Entering edit mode
2.3 years ago

I understand that a higher coverage would have given the software more confidence to identify variants

Not necessarily.

Anecdotal evidence constantly suggests a relationship between position read depth [at which a variant is being called] and the false-positive and false-negative rate of variant calling.

To keep this short, you can break this dependency by random sampling your reads from your main aligned and QCd BAM, and then re-calling variants on each random sub-set. At the end, you then take the consensus calls. I elaborate on this, here: A: Best tool for variant calling

Kevin

ADD COMMENT
3
Entering edit mode

I would argue that it strongly depends on the variant caller. If you use tools like VarScan2 which use a statistical framework to calculate probabilities for a certain genotype to be present then decreasing read numbers will reduce power and therefore subsampling would reduce the number and confidence of variants. Not sure how GATK calls cariants though.

ADD REPLY
0
Entering edit mode

That too - yes (i.e. the variant caller)

ADD REPLY
0
Entering edit mode

Thank you both for the insight. Although, it is still not clear to me whether a decrease in fastq file size will linearly reduce the number of variants?

ADD REPLY
1
Entering edit mode
2.3 years ago
harish ▴ 410

Also there are quite a few relevant blog posts from Brad Chapman about this on bcbio and a lovely article by Heng Li.

Heng Li's article: https://academic.oup.com/bioinformatics/article/30/20/2843/2422145

Bcbio posts: https://bcbio.wordpress.com/2013/10/21/updated-comparison-of-variant-detection-methods-ensemble-freebayes-and-minimal-bam-preparation-pipelines/

Production scripts on variant filtering in bcbio toolkit: https://github.com/bcbio/bcbio-nextgen/blob/98c12fdaa8ce6ab9c6c1fdfb4db39df9c7b548ff/bcbio/variation/vfilter.py#L120

While some of these posts might be older, they still offer an immense value.

ADD COMMENT

Login before adding your answer.

Traffic: 599 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6