Question: What are acceptable SNP QUAL and GQ thresholds to filter on?
1
gravatar for jgbradley1
5.2 years ago by
jgbradley1100
United States
jgbradley1100 wrote:

I have been trying to figure out what are acceptable thresholds to filter SNPs on. I'm using GATK HaplotypeCaller and the distribution of QUAL scores that I see look like

 

The huge peak of SNPs at QUAL ~1000 looks odd to me, but I have seen this same distribution in both a human and dog sample, so I don't think it is a sample-specific error. Can someone explain why the peak is there or what is going on?

snp threshold qual • 3.1k views
ADD COMMENTlink modified 5.2 years ago by Len Trigg1.4k • written 5.2 years ago by jgbradley1100
1
gravatar for Brian Bushnell
5.2 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

Normally, when you see that in a histogram, it's because all values higher than the highest point on the histogram were lumped together.  So presumably anything with an actual value over 1000 is being binned at 1000, so the peak is not real, just an artifact of binning.

You can't determine a cutoff just from a graph like that; to determine one empirically, you need to analyze versus a gold standard or use a trio so you can verify which ones might be correct on the basis of inheritance, etc.

ADD COMMENTlink modified 5.2 years ago • written 5.2 years ago by Brian Bushnell17k
1
gravatar for Len Trigg
5.2 years ago by
Len Trigg1.4k
New Zealand
Len Trigg1.4k wrote:

Following on a little from Brian's answer, the threshold you would choose depends on where you want to make your precision/sensitivity tradeoff. In an ideal world you could work out expected precision/sensitivity directly from the probabilities represented by the QUAL or GQ scores but in practise those scores are not well calibrated. If you do have a gold-standard call-set for your sample, you can use RTG Tools (free) or RTG Core (free for non-commercial use) from our website, it makes running the comparison and seeing the effects of different thresholds very easy:

rtg vcfeval -t ref -b gold-.vcf.gz -c calls.vcf.gz -o eval-GQ
rtg vcfeval -t ref -b gold.vcf.gz -c calls.vcf.gz -f QUAL -o eval-QUAL
rtg vcfeval -t ref -b gold-.vcf.gz -c calls.vcf.gz -f INFO=VQSLOD -o eval-GQ # if you have run VQSR
rtg rocplot eval-*/weighted_roc.tsv.gz

The last command brings up a gui containing the ROC curves for comparison and using a slider you can see the effects of applying a threshold on your sensitivity / precision.

 

 

 

ADD COMMENTlink written 5.2 years ago by Len Trigg1.4k
0
gravatar for Dan D
5.2 years ago by
Dan D7.1k
Tennessee
Dan D7.1k wrote:

Brian is spot-on about the quality scores mostly being over 1000. Per his suggestion of studying the quality scores in a platinum set, I pulled the quality scores from the Stanford NA12878 Platinum VCF file. The minimum quality score is 1000.1

ADD COMMENTlink written 5.2 years ago by Dan D7.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 615 users visited in the last hour