Question

Bam-Readcount And Quality Score Encoding

0

Entering edit mode

11.2 years ago

loeblabstuff • 0

I'm curious if bam_readcount is sensitive to the quality score encoding (phred+33 vs phred+64). I've run across a situation in which bam_readcount calculates the average quality score at a given position, but if you look at the actual pileup file at that position, the average quality score is very different (for example, bam_readcount calculates the average quality as 12, whereas the bamfile shows the quality scores being >70). I'm curious if the quality score encoding could cause this discrepancy.

• 4.0k views

ADD COMMENT • link updated 11.2 years ago by ernfrid ▴ 400 • written 11.2 years ago by loeblabstuff • 0

score 0 · Answer 1 · 2013-02-20

bam-readcount should have no inherent notion of quality encoding and this result is unexpected.

This could be a bug, but I would like to verify a few things. Just to be clear, bam-readcount calculates the average base quality and average mapping quality etc per the observed base call. You are comparing a single type of read between pileup and the bam-readcount output (reads containing a G to reads containing a G at a given position)?

score 0 · Answer 2 · 2013-02-21

I'm not entirely sure you mean by single type of read. I'll give an real world example. At position 15546 in the yeast genome, this strain has ostensibly obtained a heterozygous mutation (ref C, var T).

Running bam-readcount I get the following output: ref|NC_001133| 15546 C 85 =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 A:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 C:36:59.67:66.78:37.00:17:19:0.64:0.00:12.03:17:0.62:101.00:0.58 G:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 T:49:59.35:61.53:37.00:26:23:0.68:0.01:74.08:26:0.51:101.00:0.53 N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00

From my understanding of the output and reading of your source code (as limited as my understanding of C++ is), the average quality score for C is 12.03 and T is 74.08. However, if you look at the pileup file (or in IGV, etc), the average quality scores of both base calls at position 15546 are very similar (C: 62.4; T:61.18). It may be that I don't understand how the average base quality is calculated in your code, but it looks to me like it follows the logic of 'if the depth at a particular position is > 0, the take the sum of the qualities of a particular base call and sum them, then divide by the number of times that base call is observed'. Hope that clarifies things.

score 0 · Answer 3 · 2013-02-21

That clarifies immensely. The column you are looking at is not the average base quality. For your C bases these are the fields:

C -> base

36 -> number of reads with this base

59.67 -> average mapping quality score

66.78 -> average base quality score

37.00 -> average single ended mapping quality

17 -> number on the plus/forward strand

19 -> number on the minus/reverse strand

0.64 -> average position on the read as a fraction (respects clipping)

0.00 -> average number of mismatches on these reads per base

12.03 -> average sum of the base qualities of mismatches in the reads

17 -> number of reads with q2 runs

0.62 -> average distance of position (as fraction of unclipped read length) to the start of the q2 run

101 -> average clipped read length of reads

0.58 -> average distance to the 3' prime end of the read (as fraction of unclipped read length)

score 0 · Answer 4 · 2013-02-21

Ah, OK. That answers that question, but then raises another one. What's the definition of the average sum of the base qualities of mismatches in the reads? More specifically, how is this calculated? For instance, given the example I gave above, it's not clear to me how the reference base call of C at position 15546 is calculated to be 12.03, while the variant T at the same position is calculated to be 74.08.

score 0 · Answer 5 · 2013-02-21

Each read is scanned and for each base which doesn't match the reference sequence, its base quality is summed for the read. This number is then averaged across all of the reads. Typically there are errors at the ends of the reads and these are lower quality. The reference base in your example, C, is not a mismatch so it isn't included in the sum. Thus what you're seeing are mismatches in each read which are either infrequent or of low quality or both. For your variant reads, they each have a high quality mismatch (the variant) and thus the number is higher because there are still other mismatches in these reads. Hope that helps and makes sense.