I am trying calculate RMS (root-mean-square) MQ (MappingQuality) from a bam file for ever sites (even no variation), I visited GATK's blog:
I note they said:
Root Mean Square of the mapping quality of reads across all samples. This annotation provides an estimation of the overall mapping quality of reads supporting a variant call, averaged over all samples in a cohort.
The raw data format for this annotation consists of a list of two entries: the sum of the squared mapping qualities and the number of reads across variant (not homRef) genotypes
So if I want to calculate a RMSMQ for no variation site (homRef), can I use following formula?
RMS_MQ = sqrt(sum(MQ_i^2)/N) .......... (1)
where, MQ_i is the MQ of reads
i coverage the site, N is the total number of reads coverage the site.
If this is correct, I note this section in GATK:
Statistical notes The root mean square is equivalent to the mean of the mapping qualities plus the standard deviation of the mapping qualities.
RMS_MQ = mean(MQ_i) + sd(MQ_i) ........... (2)
Is it correct? but I noted the output of formula (2) is not equal to formula (1)
Maybe, I think the formal (1) is correct for most situations (for example the earlier version of GATK). However, there are some changes in the way GATK is calculated at present, they seem to only calculate the RMS mapquality that supports the existence of variations in the lastest version. I am not sure, but you can furture check https://gatk.broadinstitute.org/hc/en-us/articles/360037591751-RMSMappingQuality