Question: (samtools) Filtering and interpreting >100 QUAL scores
0
gravatar for umn_bist
21 months ago by
umn_bist240
umn_bist240 wrote:

I am in the process of filtering/selecting variants via samtools/bcftools for sample RNA sequence. As the title indicates I am getting variant calls (from my pileup) that have QUAL scores >100. I understand that there is not a theoretical limit to phred-scaled quality score but I wanted to make sure I am interpreting it correctly.

Question 1. I understand that QUAL column in a vcf file denotes Phred-scaled probability of all samples being homozygous reference and MQ (mapping quality score) quantify the probability that a read is misplaced. Does this mean that a LOW QUAL and HIGH MQ scores mean the alternative sequence is likely a true variant?

Question 2. I am wondering how you guys have dealt with QUAL scores >100 for annotating purposes?

EXAMPLE:

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT  
chr1 150898915   .      C      T    113.066    PASS    DP=17;VDB=0.915697;SGB=-0.636426;MQ0F=0;AF1=1;AC1=2;DP4=0,0,7,0;MQ=50;FQ=-47.986    GT:PL    1/1:146,21,0

chr1 150898931   .      T      C    207.999    PASS    DP=31;VDB=0.000846843;SGB=-0.692067;MQSB=0.944444;MQ0F=0;AF1=1;AC1=2;DP4=0,0,18,2;MQ=49;FQ=-86.9865    GT:PL    1/1:241,60,0

 

ADD COMMENTlink modified 21 months ago by Brice Sarver2.3k • written 21 months ago by umn_bist240
1

How did you call variants?

ADD REPLYlink written 21 months ago by Brice Sarver2.3k

I called the variant using samtools mpileup and bcftool call options.

samtools mpileup -uf </path/to/ref.fasta> </path/to/sorted.bam> -l </path/to/region.bed> | bcftools call -vc - > </path/to/pileup.vcf>

Thank you for your help.

ADD REPLYlink modified 21 months ago • written 21 months ago by umn_bist240
1

samtools/bcftools > v1.0 or < v1.0?

ADD REPLYlink written 21 months ago by Brice Sarver2.3k

This was done with samtools/bcftools v1.3.

ADD REPLYlink written 21 months ago by umn_bist240
4
gravatar for Brice Sarver
21 months ago by
Brice Sarver2.3k
United States
Brice Sarver2.3k wrote:

You are invoking the old variant calling method with -c. So, unless I'm misunderstanding something, the QUAL field is still Phred-scaled probability of there being a non-homozygous reference call at that position. It also appears that this number takes other factors into account and can be quite large. Here's a comment from the GATK forums from a couple of years ago outlining this within their framework:

You may be thinking of Phred-scaled base qualities, which do indeed tend to be capped at certain values by convention. The various likelihood metrics emitted by GATK, which are also Phred-scaled, can easily go up into the thousands. The important thing to understand is that they are relative, not absolute values. So it does depend very much on the particular dataset you are working with.

The QUAL value reflects how confident we are that a site displays some kind of variation considering the amount of data available (=depth of coverage at the site) (because we are more confident when we have more observations to rely on), the quality of the mapping of the reads and alignment of the bases (because if we are not sure the bases observed really belong there, they do not contribute much to our confidence), and the quality of the base calls (because if they look like machine errors, they also do not contribute much to our confidence). Filtering on the QUAL value should be done by adjusting the emit and call thresholds at the calling step.

GQ is a very different metric; it's not about whether the site displays variation. GQ tells you whether, given a site for which there is variation in the population, each sample has been assigned the correct genotype.

Regarding filtering, have a look at our Best Practices document; the very last section gives some base recommendations for hard-filtering if you cannot perform variant recalibration (VQSR).

Variant qualities, like GQ etc., take a lot of this information into account. If you are dealing with large datasets, you can be a bit more liberal on your filtering. But, in your case, a high mapping quality just means that a read has a higher probability of being appropriately placed, but a low QUAL means that there is not a lot of confidence that there is variation there for a variety of possible reasons.

I realize that there have been a couple of other interpretations floating around online, so I would love it if someone would correct me if I'm interpreting this incorrectly. I think what I'm saying is at least consistent with the latest VCF specification.

ADD COMMENTlink modified 21 months ago • written 21 months ago by Brice Sarver2.3k

Thank you for your reply. After more reading I agree with your interpretation. The QUAL score is largely the assertion that there exists a variant. Here is a coment from GATK forums:

The Phred-scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance (see the FAQ article for a detailed explanation). These values can grow very large when a large amount of data is used for variant calling, so QUAL is not often a very useful property for evaluating the quality of a variant call. See our documentation on filtering variants for more information on this topic.

From samtools documentation, MQ (mapping quality) denotes confidence that the ALT read has been appropriately placed. As you hinted I think filtering against GQ and MQ score may be preferable. I am wondering, based on my example vcf output, is my calling option specifically omitting GQ values from my vcf file? Thank you again for your help.

ADD REPLYlink modified 21 months ago • written 21 months ago by umn_bist240
1

It doesn't look like your VCF has that in the output, but it does have the normalized, Phred-scaled likelihood:

##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">

Most studies no longer use the mpileup -> bcftools method of calling variants; the GATK's HaplotypeCaller (or UnifiedGenotyper, if speed is an issue) or FreeBayes are improvements on the original approach and more accurately call haplotypes. I'd recommend using one of these and following established practices for mapping and calling. This way, most of the filtering information you see online will make more sense.

ADD REPLYlink written 21 months ago by Brice Sarver2.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1585 users visited in the last hour