Question: Read To Consensus Quality Scores
3
Lee Katz2.9k wrote:

Hi, what is the actual formula for finding the quality score in a consensus sequence base?

I'll give a hypothetical example: in one base in an assembly the constituent read quality scores are 10, 30, 40 (or, error rates of 10^-2, 10^-4, 10^-5). If the base calls all agree with each other, then what is the consensus base's quality score?

Another example: the read bases are A, A, C with respective quality scores of 30, 40, and 10. What is the consensus base's quality score?

assembly quality • 4.6k views
written 8.7 years ago by Lee Katz2.9k

In the second example, I am assuming that A is the correct base.

6
Neilfws48k wrote:

I think the short answer is: it depends on the software and it's heuristic.

Here are a few resources that might be useful:

• Adjust quality scores from alignment and improve sequencing accuracy; this is a rather mathematical treatment, describing the use of quality scores to find consensus.

• Phrap documentation; Phrap is the assembler that old-timers like me used to use, before the days of next-generation sequencing. The documentation contains a description of how adjusted consensus scores are obtained from Phred scores and other information. Search the page for the text that begins "Phrap adjusted quality values/error probabilities: Phrap computes adjusted quality values for each read on the basis of read-read confirmation information, as follows."

• GAP 4 documentation; this section contains a description of the consensus algorithms in another "old-school" assembler, GAP4.

• This page appears to be part of a thesis describing some software named MIRA and has a (mathematical) section titled "Consensus and consensus quality computation".

Also try Google using "consensus quality" algorithm (include the quotes); there seem to be quite a few useful hits.

this brought me on-track again, Phrap/Phred is where these scores come from originally

4

Michael is right in that the pure probabilistic math is not straightforward. Multiple reads are not completely independent, but each additional data point does tell you more about your probability of being correct.

A heuristic approach we used at my last job was to add probabilities, discounting each same-strand read. If you have 3 forward strand calls of 30 and 1 reverse strand call of 20, the overall probability of that base would be:

``````30 + 30/2 + 30/3 + 20
``````

The Python implementation is in the 'KeithBaseCaller' class. This was used with Sanger sequencing and the heuristic is biased toward SNP false positives -- it avoids calling a potentially mismatched base as reference unless there are twice as many reference calls to SNP calls. Hopefully this is a useful starting point for your work.

well I was wrong actually, at least a little bit

2
Michael Dondrup46k wrote:

I just gave this some theoretical thoguht, I didn't look at how software does it or if there are papers out there. My 50cent is there cannot be a single best solution. Let's assume for a moment that the score values want to tell you something important. They are still arbitrary scores and have no probabilistic interpretation. We can than simply treat them as observations of a random variable, say X_i, where i is the consensus base position, and we have x_1 ... x_n observations, then I would say, theoretically, the best consensus score would be E(X_i) the expected value of X_i, estimated from the x_1 ... x_n.

But: You do not know anything about the distribution of X, it might be on an interval scale, but that's all, so it is hard to estimate the Expected value, e.g. the arithmetic mean will be a very bad estimator, if the distribution isn't central or asymmetric.

So here are several valid options you might want to test:

• Ignore combined scores: If the estimate is bad anyway, and there is no good interpretation of them, why bother? (my approach)
• Set to a constant: Same argument as above but some software might want scores
• Use a quantil-based estimate which doesn't make assumption about the distribution:
• median
• minimum, maximum score (of only the bases which are equal to the consensus?)
• mode (value with highest frequency)
• try to find out, which distribution the scores follow and model them

Edit: I was a bit wrong with "no probabilistic intepretation". If they are phred scores http://www.phrap.com/phred/ they can have one: "The quality scores are logarithmically linked to error probabilities, as shown in the following table..."

if that is the case and the qualities are independ, the values would be additive as well the error probabilities can be simply multiplied.

1

nope, probabilities should be multiplied if events are idependent: P(A,B) = P(A) * P(B), thus log(P(A,B)) = log(P(A)) + log(P(B))

I don't know how to really apply this to data, but I think last link in neil's post points into the right direction.... need to do more reading sorry ;)

Yes I think this is where I want to go. How do I add these error probabilities? How are incorrect calls' qualities used? Are they subtracted, and how? I don't think that phred scores themselves are additive, but rather their error probabilities must be.

This makes sense, thanks! I made a spreadsheet to confirm. I'm not sure if it will paste right. Base calls that disagree with the consensus are given a negative phred score. phred=-10(log(P))

``````phred   P
10  0.1
30  0.001
40  0.0001
20  0.01
-20 100
-40 10000

phred
``````

total 0.0001 40

Well, it didn't paste right, but I multiplied each probability as derived from the phred score and transformed it back to phred. The quality score seems reasonable.

0
lh331k wrote:

For the purpose of SNP calling, you should read the SOAPsnp paper [PMID:19420381]. It describes a well accepted model, although improvements have been also made later.

Most methods used by de novo assemblers assume your sample being sequenced is haploid. Such methods are not useful for SNP calling given a diploid (e.g. human) sample. I think assemblers should really consider to use more sophisticated models to construct the consensus. We are frequently dealing with diploid samples after all.

This is a good answer, but probably for a different question.