4.5 years ago by
You are invoking the old variant calling method with -c. So, unless I'm misunderstanding something, the QUAL field is still Phred-scaled probability of there being a non-homozygous reference call at that position. It also appears that this number takes other factors into account and can be quite large. Here's a comment from the GATK forums from a couple of years ago outlining this within their framework:
You may be thinking of Phred-scaled base qualities, which do indeed tend to be capped at certain values by convention. The various likelihood metrics emitted by GATK, which are also Phred-scaled, can easily go up into the thousands. The important thing to understand is that they are relative, not absolute values. So it does depend very much on the particular dataset you are working with.
The QUAL value reflects how confident we are that a site displays some kind of variation considering the amount of data available (=depth of coverage at the site) (because we are more confident when we have more observations to rely on), the quality of the mapping of the reads and alignment of the bases (because if we are not sure the bases observed really belong there, they do not contribute much to our confidence), and the quality of the base calls (because if they look like machine errors, they also do not contribute much to our confidence). Filtering on the QUAL value should be done by adjusting the emit and call thresholds at the calling step.
GQ is a very different metric; it's not about whether the site displays variation. GQ tells you whether, given a site for which there is variation in the population, each sample has been assigned the correct genotype.
Regarding filtering, have a look at our Best Practices document; the very last section gives some base recommendations for hard-filtering if you cannot perform variant recalibration (VQSR).
Variant qualities, like GQ etc., take a lot of this information into account. If you are dealing with large datasets, you can be a bit more liberal on your filtering. But, in your case, a high mapping quality just means that a read has a higher probability of being appropriately placed, but a low QUAL means that there is not a lot of confidence that there is variation there for a variety of possible reasons.
I realize that there have been a couple of other interpretations floating around online, so I would love it if someone would correct me if I'm interpreting this incorrectly. I think what I'm saying is at least consistent with the latest VCF specification.