Question: Read To Consensus Quality Scores
gravatar for Lee Katz
8.7 years ago by
Lee Katz2.9k
Atlanta, GA
Lee Katz2.9k wrote:

Hi, what is the actual formula for finding the quality score in a consensus sequence base?

I'll give a hypothetical example: in one base in an assembly the constituent read quality scores are 10, 30, 40 (or, error rates of 10^-2, 10^-4, 10^-5). If the base calls all agree with each other, then what is the consensus base's quality score?

Another example: the read bases are A, A, C with respective quality scores of 30, 40, and 10. What is the consensus base's quality score?

assembly quality • 4.6k views
ADD COMMENTlink written 8.7 years ago by Lee Katz2.9k

In the second example, I am assuming that A is the correct base.

ADD REPLYlink written 8.7 years ago by Lee Katz2.9k
gravatar for Neilfws
8.7 years ago by
Sydney, Australia
Neilfws48k wrote:

I think the short answer is: it depends on the software and it's heuristic.

Here are a few resources that might be useful:

  • Adjust quality scores from alignment and improve sequencing accuracy; this is a rather mathematical treatment, describing the use of quality scores to find consensus.

  • Phrap documentation; Phrap is the assembler that old-timers like me used to use, before the days of next-generation sequencing. The documentation contains a description of how adjusted consensus scores are obtained from Phred scores and other information. Search the page for the text that begins "Phrap adjusted quality values/error probabilities: Phrap computes adjusted quality values for each read on the basis of read-read confirmation information, as follows."

  • GAP 4 documentation; this section contains a description of the consensus algorithms in another "old-school" assembler, GAP4.

  • This page appears to be part of a thesis describing some software named MIRA and has a (mathematical) section titled "Consensus and consensus quality computation".

Also try Google using "consensus quality" algorithm (include the quotes); there seem to be quite a few useful hits.

ADD COMMENTlink written 8.7 years ago by Neilfws48k

this brought me on-track again, Phrap/Phred is where these scores come from originally

ADD REPLYlink written 8.7 years ago by Michael Dondrup46k
gravatar for Brad Chapman
8.7 years ago by
Brad Chapman9.4k
Boston, MA
Brad Chapman9.4k wrote:

Michael is right in that the pure probabilistic math is not straightforward. Multiple reads are not completely independent, but each additional data point does tell you more about your probability of being correct.

A heuristic approach we used at my last job was to add probabilities, discounting each same-strand read. If you have 3 forward strand calls of 30 and 1 reverse strand call of 20, the overall probability of that base would be:

30 + 30/2 + 30/3 + 20

The Python implementation is in the 'KeithBaseCaller' class. This was used with Sanger sequencing and the heuristic is biased toward SNP false positives -- it avoids calling a potentially mismatched base as reference unless there are twice as many reference calls to SNP calls. Hopefully this is a useful starting point for your work.

ADD COMMENTlink written 8.7 years ago by Brad Chapman9.4k

well I was wrong actually, at least a little bit

ADD REPLYlink written 8.7 years ago by Michael Dondrup46k
gravatar for Michael Dondrup
8.7 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

I just gave this some theoretical thoguht, I didn't look at how software does it or if there are papers out there. My 50cent is there cannot be a single best solution. Let's assume for a moment that the score values want to tell you something important. They are still arbitrary scores and have no probabilistic interpretation. We can than simply treat them as observations of a random variable, say X_i, where i is the consensus base position, and we have x_1 ... x_n observations, then I would say, theoretically, the best consensus score would be E(X_i) the expected value of X_i, estimated from the x_1 ... x_n.

But: You do not know anything about the distribution of X, it might be on an interval scale, but that's all, so it is hard to estimate the Expected value, e.g. the arithmetic mean will be a very bad estimator, if the distribution isn't central or asymmetric.

So here are several valid options you might want to test:

  • Ignore combined scores: If the estimate is bad anyway, and there is no good interpretation of them, why bother? (my approach)
  • Set to a constant: Same argument as above but some software might want scores
  • Use a quantil-based estimate which doesn't make assumption about the distribution:
  • median
  • minimum, maximum score (of only the bases which are equal to the consensus?)
  • mode (value with highest frequency)
  • try to find out, which distribution the scores follow and model them

Comments appreciated.

Edit: I was a bit wrong with "no probabilistic intepretation". If they are phred scores they can have one: "The quality scores are logarithmically linked to error probabilities, as shown in the following table..."

if that is the case and the qualities are independ, the values would be additive as well the error probabilities can be simply multiplied.

ADD COMMENTlink modified 8.7 years ago • written 8.7 years ago by Michael Dondrup46k

nope, probabilities should be multiplied if events are idependent: P(A,B) = P(A) * P(B), thus log(P(A,B)) = log(P(A)) + log(P(B))

I don't know how to really apply this to data, but I think last link in neil's post points into the right direction.... need to do more reading sorry ;)

ADD REPLYlink written 8.7 years ago by Michael Dondrup46k

Yes I think this is where I want to go. How do I add these error probabilities? How are incorrect calls' qualities used? Are they subtracted, and how? I don't think that phred scores themselves are additive, but rather their error probabilities must be.

ADD REPLYlink written 8.7 years ago by Lee Katz2.9k

This makes sense, thanks! I made a spreadsheet to confirm. I'm not sure if it will paste right. Base calls that disagree with the consensus are given a negative phred score. phred=-10(log(P))

phred   P   
10  0.1 
30  0.001   
40  0.0001  
20  0.01    
-20 100 
-40 10000


total 0.0001 40

ADD REPLYlink written 8.7 years ago by Lee Katz2.9k

Well, it didn't paste right, but I multiplied each probability as derived from the phred score and transformed it back to phred. The quality score seems reasonable.

ADD REPLYlink written 8.7 years ago by Lee Katz2.9k
gravatar for lh3
8.7 years ago by
United States
lh331k wrote:

For the purpose of SNP calling, you should read the SOAPsnp paper [PMID:19420381]. It describes a well accepted model, although improvements have been also made later.

Most methods used by de novo assemblers assume your sample being sequenced is haploid. Such methods are not useful for SNP calling given a diploid (e.g. human) sample. I think assemblers should really consider to use more sophisticated models to construct the consensus. We are frequently dealing with diploid samples after all.

ADD COMMENTlink modified 8.7 years ago • written 8.7 years ago by lh331k

This is a good answer, but probably for a different question.

ADD REPLYlink written 8.7 years ago by Lee Katz2.9k

Actually I think the current methods used for de novo assembly is a little crude in comparison to the recent ones developed for SNP calling. The question is the same. Why not use more sophisticated methods?

ADD REPLYlink written 8.7 years ago by lh331k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 950 users visited in the last hour