GATK MQ-scores differ depending on reference genome
1
0
Entering edit mode
23 months ago
axejen ▴ 10

Hi,

I am dealing with an issue that I can't wrap my head around, and would greatly appreciate any input on this. I am working on variant calling on a set of primate species from the Cercopithecus genus. I have mapped to two separate reference genomes, one is the rhesus macaque (MMul) which is a fairly distant outgroup, and the other is the Chlorocebus sabaeus (ChlSab) which is much closer. When I follow the gatk best practices workflow, I notice that the "standard" filtration settings in VariantFiltration, specifically the MQ threshold of 40 (MQ40), removes a massive amount of sites in the ChlSab-variants (~45 %), but very few in the MMul-set (~3 %). The distributions of the variants' MQ-score look very different depending on the reference genome (see plots). When I randomly inspect some of these filtered genotypes in IGV, they look fine to my eye. The MQ is calculated from the root mean square mapping quality of the variants, and I'm having a hard time coming with an explanation to this large discrepancy between reference genomes.

Clearly, I cannot simply use the "standard" cutoff at 40, but given the MQ-distribution I would more or less need to remove this filter altogether not to discard too many seemingly good variants. This I'm not very comfortable doing without understanding the cause behind this, though.

Has anybody come across something like this before, or does anybody have any ideas about why this may happen and how to deal with it?

Thanks, Axel Variants called against the chlorocebus sabaeus referenceVariants called against the Macaca mulatta reference

gatk filtration variant vcf • 565 views
ADD COMMENT
0
Entering edit mode
23 months ago
lethalfang ▴ 140

What's the MQ distribution for all the reads, not just the variants? MQ measures how confident you are that this read comes from this part of the genome and not somewhere else. It's very much a function of the reference genome, i.e., if there are other regions of the genome that's similar to this region, then the MQ for reads mapped to this region will generally be lower.

ADD COMMENT

Login before adding your answer.

Traffic: 2208 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6