Bam-Readcount: How Is "Average Mismatch Quality Sum" Calculated?
1
4
Entering edit mode
11.2 years ago
mrfox ▴ 40

Hi all,

I have a question on how bam-readcount calculates the "average mismatch quality sum":

Assume there are 3 aligned reads (with length 100) at a locus, read1 has 3 mismatches, the base qualities are 2,2,2 respectively. read2 has 2 mismatches, the base qualities are 10,10 and read3 is a perfect alignment with no mismatch.

Could you please describe exactly how to calculate the avg_sum_mismatch_qualities of the 3 reads?

Thank you.

quality • 4.2k views
6
Entering edit mode
11.2 years ago
tgi.tabbott ▴ 230

Hi Hao,

Before we talk about the average mismatch quality sum, it would be good to define mismatch quality sum (let's call this mmQS for brevity). This property is associated with each read, not each position, and it is calculated almost exactly how you would expect. I would have guessed that it simply adds up the qualities of all mismatching bases in a given read, and this guess is very close to the truth. The only thing that is special is that adjacent mismatches are combined into a single event, and we define the quality for this combined event to be the minimum quality over the run of mismatching bases.

In your example, read1 has 3 mismatches each with quality two, so the possible values for mmQS are:

• 2 (if all mismatches are adjacent)
• 4 (if 2 mismatches are adjacent)
• 6 (if no mismatches are adjacent)

For read2, we can get mmQS=10 or 20 by similar logic. Read3 of course has mmQS=0 since it has no mismatches.

The "average mismatch quality sum" values in the output are specific not only to a particular position, but also to a particular nucleotide. For a given position P and nucleotide B, the average mmQS is the simple arithmetic mean of the mmQS values of all reads that reported base B at position P.

As you can see, the value in question can vary a bit depending on where the mismatches are and which reads report which bases at a particular position. Since your example is small, I've enumerated the possibilities that can happen for a fixed position and nucleotide combination (P,B) while making no assumption about the adjacency of mismatches below. The possibilities are of the form:

{set of reads reporting base B at position P} => {set of possible values for average mismatch quality sum}.

• the empty set (not very interesting)
• {read1} => {2/1, 4/1, 6/1}
• {read1, read2} => {12/2, 14/2, 16/2, 22/2, 24/2, 26/2} # { (x+y)/2 | x \in {2,4,6}, y \in {10, 20} }

In case you would like to check your understanding, a concrete example using imaginary 5bp reads similar to the three you suggested follows. The mismatching bases are assumed to have the same qualities as you specified (2 for read1, 10 for read2), and are given in lowercase to make them easy to spot. Assume that these reads all start at position 1 on some sequence.

  ref: ACGTA
read1: AgcTt  <- gc are combined into a single event; mmQS = min(2, 2) + 2 = 4
read2: tCGTt  <- mismatches are not adjacent, mmQS = 10 + 10 = 20
read3: ACGTA  <- mmQS = 0

Resulting average mmQS values:
POS  BASE  Average.mmQS
1    A     2
1    T     20

2    C     10
2    G     4

3    C     4
3    G     10

4    T     8

5    A     0
5    T     12


Hope this helps.

0
Entering edit mode

HI , is there some paper to explain that ?