Question

What are acceptable SNP QUAL and GQ thresholds to filter on?

1

Entering edit mode

9.0 years ago

jgbradley1 ▴ 110

I have been trying to figure out what are acceptable thresholds to filter SNPs on. I'm using GATK HaplotypeCaller and the distribution of QUAL scores that I see look like

The huge peak of SNPs at QUAL ~1000 looks odd to me, but I have seen this same distribution in both a human and dog sample, so I don't think it is a sample-specific error. Can someone explain why the peak is there or what is going on?

SNP qual • 5.6k views

ADD COMMENT • link updated 14 months ago by Ram 43k • written 9.0 years ago by jgbradley1 ▴ 110

Ram · Answer 1 · 2015-05-07

Normally, when you see that in a histogram, it's because all values higher than the highest point on the histogram were lumped together. So presumably anything with an actual value over 1000 is being binned at 1000, so the peak is not real, just an artifact of binning.

You can't determine a cutoff just from a graph like that; to determine one empirically, you need to analyze versus a gold standard or use a trio so you can verify which ones might be correct on the basis of inheritance, etc.

Ram · Answer 2 · 2015-05-08

Following on a little from Brian's answer, the threshold you would choose depends on where you want to make your precision/sensitivity tradeoff. In an ideal world you could work out expected precision/sensitivity directly from the probabilities represented by the QUAL or GQ scores but in practise those scores are not well calibrated. If you do have a gold-standard call-set for your sample, you can use RTG Tools (free) or RTG Core (free for non-commercial use) from our website, it makes running the comparison and seeing the effects of different thresholds very easy:

rtg vcfeval -t ref -b gold-.vcf.gz -c calls.vcf.gz -o eval-GQ
rtg vcfeval -t ref -b gold.vcf.gz -c calls.vcf.gz -f QUAL -o eval-QUAL
rtg vcfeval -t ref -b gold-.vcf.gz -c calls.vcf.gz -f INFO=VQSLOD -o eval-GQ # if you have run VQSR
rtg rocplot eval-*/weighted_roc.tsv.gz

The last command brings up a gui containing the ROC curves for comparison and using a slider you can see the effects of applying a threshold on your sensitivity / precision.

Ram · Answer 3 · 2015-05-07

0

Entering edit mode

9.0 years ago

Dan D 7.4k

Brian is spot-on about the quality scores mostly being over 1000. Per his suggestion of studying the quality scores in a platinum set, I pulled the quality scores from the Stanford NA12878 Platinum VCF file. The minimum quality score is 1000.1

ADD COMMENT • link updated 14 months ago by Ram 43k • written 9.0 years ago by Dan D 7.4k