Compute Mapping quality for invariant sites
2
0
Entering edit mode
4.1 years ago
elcortegano ▴ 200

Hi, I am having some issues with the VCF files generated from GATK caller, as they are not returning a mapping quality value for many positions, specially invariant sites.

Since the BAM files have a mapping quality score on every read, I am assuming that there is a way to get that value for every position without needing to use GATK. What are some alternatives? In a case where multiple samples are being used, do these MQ should be simply be averages between samples at every position?

In case you wonder how I am using GATK, I post relevant code below:

java -jar gatk HaplotypeCaller -I file.bam -O file.g.vcf -R reference.fa -ploidy 1 -ERC BP_RESOLUTION    
# The above is done for different input files
java -jar gatk CombineGVCFs -R reference.fa -O combined.g.vcf --variant file1.g.vcf --variant file2.g.vcf ...
java -jar gatk GenotypeGVCFs -R reference.fa -V combined.g.vcf -O variants.vcf -ploidy 1 -all-sites

For some reason, this results in many MQ values being absent from the final VCF file (as well as many QUAL values taking an Infinity value).

next-gen mapping-quality • 1.1k views
ADD COMMENT
1
Entering edit mode
4.1 years ago
elcortegano ▴ 200

In the end, what worked for me was switching the software. Now I am using freebayes, which does provide mapping qualities for all variant and invariant sites (e.g. using --report-monomorphic option).

ADD COMMENT
0
Entering edit mode
4.1 years ago

A read's mapping quality is "how likely is it that this read's origin has been correctly determined?". A base's quality within a read is "how likely is it that this base is what we say it is?". Quality for a variant is "how likely is it that this base is not homozygous reference?", which is taking into account mapping quality and individual base quality.

I don't think it makes sense to worry about the mapping quality alone at a variant locus. And I don't think it makes sense for there to be a quality score for a locus that is homozygous reference.

ADD COMMENT
0
Entering edit mode

It makes sense if you need to consider all variant and invariant sites but you want to restrict the analyses to well aligned sites. We are working with mutation accumulation data, so as you can imagine most of the genome is invariant, but we still need to know which fraction has enough quality for considering it in the computation of mutation rates

ADD REPLY

Login before adding your answer.

Traffic: 2334 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6