Question

Read To Consensus Quality Scores

3

Entering edit mode

13.5 years ago

Lee Katz ★ 3.1k

Hi, what is the actual formula for finding the quality score in a consensus sequence base?

I'll give a hypothetical example: in one base in an assembly the constituent read quality scores are 10, 30, 40 (or, error rates of 10^-2, 10^-4, 10^-5). If the base calls all agree with each other, then what is the consensus base's quality score?

Another example: the read bases are A, A, C with respective quality scores of 30, 40, and 10. What is the consensus base's quality score?

quality quality assembly • 7.3k views

ADD COMMENT • link updated 13.5 years ago by lh3 33k • written 13.5 years ago by Lee Katz ★ 3.1k

0

Entering edit mode

In the second example, I am assuming that A is the correct base.

ADD REPLY • link 13.5 years ago by Lee Katz ★ 3.1k

Ram · Answer 1 · 2010-11-04

I think the short answer is: it depends on the software and its heuristic.

Here are a few resources that might be useful:

Adjust quality scores from alignment and improve sequencing accuracy; this is a rather mathematical treatment, describing the use of quality scores to find consensus.
Phrap documentation; Phrap is the assembler that old-timers like me used to use, before the days of next-generation sequencing. The documentation contains a description of how adjusted consensus scores are obtained from Phred scores and other information. Search the page for the text that begins "Phrap adjusted quality values/error probabilities: Phrap computes adjusted quality values for each read on the basis of read-read confirmation information, as follows."
GAP 4 documentation; this section contains a description of the consensus algorithms in another "old-school" assembler, GAP4.
This page appears to be part of a thesis describing some software named MIRA and has a (mathematical) section titled "Consensus and consensus quality computation".

Also try Google using "consensus quality" algorithm (include the quotes); there seem to be quite a few useful hits.

Ram · Answer 2 · 2010-11-04

Michael is right in that the pure probabilistic math is not straightforward. Multiple reads are not completely independent, but each additional data point does tell you more about your probability of being correct.

A heuristic approach we used at my last job was to add probabilities, discounting each same-strand read. If you have 3 forward strand calls of 30 and 1 reverse strand call of 20, the overall probability of that base would be:

30 + 30/2 + 30/3 + 20

The Python implementation is in the 'KeithBaseCaller' class. This was used with Sanger sequencing and the heuristic is biased toward SNP false positives -- it avoids calling a potentially mismatched base as reference unless there are twice as many reference calls to SNP calls. Hopefully this is a useful starting point for your work.

Ram · Answer 3 · 2010-11-04

2

Entering edit mode

13.5 years ago

Michael 54k

I just gave this some theoretical thoguht, I didn't look at how software does it or if there are papers out there. My 50cent is there cannot be a single best solution. Let's assume for a moment that the score values want to tell you something important. They are still arbitrary scores and have no probabilistic interpretation. We can than simply treat them as observations of a random variable, say X_i, where i is the consensus base position, and we have x_1 ... x_n observations, then I would say, theoretically, the best consensus score would be E(X_i) the expected value of X_i, estimated from the x_1 ... x_n.

But: You do not know anything about the distribution of X, it might be on an interval scale, but that's all, so it is hard to estimate the Expected value, e.g. the arithmetic mean will be a very bad estimator, if the distribution isn't central or asymmetric.

So here are several valid options you might want to test:

Ignore combined scores: If the estimate is bad anyway, and there is no good interpretation of them, why bother? (my approach)
Set to a constant: Same argument as above but some software might want scores
Use a quantil-based estimate which doesn't make assumption about the distribution:
median
minimum, maximum score (of only the bases which are equal to the consensus?)
mode (value with highest frequency)
try to find out, which distribution the scores follow and model them

Comments appreciated.

Edit: I was a bit wrong with "no probabilistic intepretation". If they are phred scores http://www.phrap.com/phred/ they can have one: "The quality scores are logarithmically linked to error probabilities, as shown in the following table..."

if that is the case and the qualities are independ, the values would be additive as well the error probabilities can be simply multiplied.

ADD COMMENT • link 13.5 years ago by Michael 54k

1

Entering edit mode

nope, probabilities should be multiplied if events are idependent: P(A,B) = P(A) * P(B), thus log(P(A,B)) = log(P(A)) + log(P(B))

I don't know how to really apply this to data, but I think last link in neil's post points into the right direction.... need to do more reading sorry ;)

ADD REPLY • link 13.5 years ago by Michael 54k

0

Entering edit mode

Yes I think this is where I want to go. How do I add these error probabilities? How are incorrect calls' qualities used? Are they subtracted, and how? I don't think that phred scores themselves are additive, but rather their error probabilities must be.

ADD REPLY • link 13.5 years ago by Lee Katz ★ 3.1k

0

Entering edit mode

This makes sense, thanks! I made a spreadsheet to confirm. I'm not sure if it will paste right. Base calls that disagree with the consensus are given a negative phred score.

phred=-10(log(P))

phred   P   
10  0.1 
30  0.001   
40  0.0001  
20  0.01    
-20 100 
-40 10000

phred
total       0.0001  40

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 13.5 years ago by Lee Katz ★ 3.1k

0

Entering edit mode

Well, it didn't paste right, but I multiplied each probability as derived from the phred score and transformed it back to phred. The quality score seems reasonable.

ADD REPLY • link 13.5 years ago by Lee Katz ★ 3.1k

score 0 · Answer 4 · 2010-11-04

0

Entering edit mode

13.5 years ago

lh3 33k

For the purpose of SNP calling, you should read the SOAPsnp paper [PMID:19420381]. It describes a well accepted model, although improvements have been also made later.

Most methods used by de novo assemblers assume your sample being sequenced is haploid. Such methods are not useful for SNP calling given a diploid (e.g. human) sample. I think assemblers should really consider to use more sophisticated models to construct the consensus. We are frequently dealing with diploid samples after all.

ADD COMMENT • link 13.5 years ago by lh3 33k

0

Entering edit mode

This is a good answer, but probably for a different question.

ADD REPLY • link 13.5 years ago by Lee Katz ★ 3.1k

0

Entering edit mode

Actually I think the current methods used for de novo assembly is a little crude in comparison to the recent ones developed for SNP calling. The question is the same. Why not use more sophisticated methods?

ADD REPLY • link 13.5 years ago by lh3 33k