Question

How Log-Odds Scores For Amino Acid Substitutions Matrices Should Be Calculated?

10

Entering edit mode

12.4 years ago

Jan Kosinski ★ 1.6k

I got puzzled by some inconsistencies between published equations for calculating log-odds scores for substitutions matrices such as Blosum62.

In the seminal paper deriving the Blosum62 matrix the odds ratio used to calculate the log-odds score is: qij/eij where qij is observed probability of occurrence of the ij pair, and eij is expected probability of occurrence of the ij pair. According to this work, eij = PiPj for i = j and PiPj + PjPi = 2PiPj. This is derived from basic theory of probability.

In many other works, however, starting I think from Altschul 1991 the equation used for odds ratio is slightly different: qij/PiPj, that is the 2 in the denominator disappeared. This new equation has been derived from the statistical framework that Karlin and Altschul have developed.

What puzzles me that although both equations are similar, the Altschul ratio is exactly twice the value of Henikoff's ratio.

I know that log-odds are rescaled before putting to the substitution table by a factor of 1/lambda, but missing log(1/2) is not rescaling, it means that the final rescaled log-odds value from Altschul is bigger than Henikoff's by x/lambda (x depends on the base of the logarithm, which does not matter and is different in different works).

And this is the Altschul ratio that is more widely used in the following years up to now. The Henikoff's equation is, however, still used in some of the papers that derive purpose-specific substitution tables (e.g. in 'Structure-derived substitution matrices for alignment of distantly related sequences' and 'AN AMINO ACID SUBSTITUTION MATRIX FOR PROTEIN CONFORMATION IDENTIFICATION').

What's even more puzzling, that several articles claim that Blosum62 has been derived from Altschul-like odds ratio, forgetting the "2" in the denominator:

Where did the BLOSUM62 alignment score matrix come from?

http://en.wikipedia.org/wiki/BLOSUM

Is everything OK here? Do the ratios mean the same thing? Isn't it wrong saying that Blosum62 has been derived from Altschul-like ratio?

sequence sequence alignment scoring • 13k views

ADD COMMENT • link updated 10.1 years ago by Biostar 20 • written 12.4 years ago by Jan Kosinski ★ 1.6k

Ram · Answer 1 · 2011-12-01

2

Entering edit mode

12.4 years ago

Damian Kao 16k

2PiPj for when i is not equal j doesn't seem right to me even though it is in the paper.

In the paper, he gives an example where there are 9 'A' residues and 1 'S' residue in a column of 10 sequences.

He calculates there are 45 possible pairs in the 10 sequences (10 + 9 + 8 + 7 + 6...). He then says observed probability (Qij) of A-S or S-A to be 9 / 45 and A-A is 36/45.

If he calculated 45 possible pairs for his Qij, then order of the sequence in the pair does not matter. Seq1 vs Seq10 is same as Seq10 vs Seq1.

He then goes on to calculate the expected probabilities (Pi and Pj) which is 9/10 for 'A' and 1/10 for 'S'.

He says when i = j, the expected probability (Eij) is Pi * Pj. When i != j, expected probability is Pi * Pj * 2. In his example, to get Eij for A-S, you would: 9/10 * 1/10 * 2. Basically, the probability of A-S + the probability of S-A.

So for calculating the expected probabilities, the order suddenly matters now?

The Pi*Pj without multiplying by two, described by [?]NCBI Blast[?] seems more correct to me.

On a semi-related note, you might find this interesting: http://www.nature.com/nbt/journal/v26/n3/full/nbt0308-274.html

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 12.4 years ago by Damian Kao 16k

0

Entering edit mode

Isn't that Eij = PiPj + PjPi = 2PiPj actually means that the order does not matter? It is essentially equivalent to rolling a loaded dice twice ... http://www.math.hawaii.edu/~ramsey/Probability/TwoDice.html (Our case is the last paragraph). In our case, we have two events in which the outcome "i aligned with j in any order" pair can occur (i,j) and (j,i). The probability of occurring of this outcome is equal to probability of seeing (i,j) or (j,i), that is PiPj + PjPi...

ADD REPLY • link 12.4 years ago by Jan Kosinski ★ 1.6k

0

Entering edit mode

I see your point. You are saying by calculating both reciprocal combinations (A-S and S-A), we are accounting for both orders, making the order not matter. But aren't we giving importance to order just by claiming A-S and S-A to be separate events? Don't we need to also multiply A-A by 2? A-A from Seq1 vs Seq10 would be considered a separate event from A-A from Seq10 vs Seq1. Honestly, I am not completely sure. But it's an interesting topic. Thanks for bringing this up.

ADD REPLY • link 12.4 years ago by Damian Kao 16k

0

Entering edit mode

You cannot distinguish between A-A from Seq1 vs Seq10 and A-A from Seq10 vs Seq1, this is a single event. Like when rolling a dice twice there is only one event of rolling (5,5) in the sample space.

ADD REPLY • link 12.4 years ago by Jan Kosinski ★ 1.6k

0

Entering edit mode

And to be consistent with proper vocabulary, I think have misused "outcome" instead of "event" and vice versa. So A-A, A-S, S-A are outcomes and the event is "i aligned with j in any order". I had my Math lessons in Polish, sorry ;-)

ADD REPLY • link 12.4 years ago by Jan Kosinski ★ 1.6k

0

Entering edit mode

I guess the dice analogy isn't very relevant to our situation. We are really talking about two separate types of combinations. There are combinations in terms of the residues and combinations in terms of the sequences. The dice analogy is looking at combination of the resulting dice rolls. We should really roll 10 separate dice once instead of rolling 1 dice 10 times.

ADD REPLY • link 12.4 years ago by Damian Kao 16k

0

Entering edit mode

From the dice example: "When rolling two dice, distinguish between them in some way: a first one and second one, a left and a right, a red and a green, etc.". So in that example they also have "combinations in terms of the residues" (residues = dice integers) and "combinations in terms of the sequences" (sequences = differently colored separate dice). In our example of alignment of aa pairs we in fact roll two separate differently colored dice once. The dice have 20 faces, each face is aa type, and each face is loaded so to make a face occur with Pi.

ADD REPLY • link 12.4 years ago by Jan Kosinski ★ 1.6k

score 1 · Answer 2 · 2011-12-09

Everything depends if the order of selection is taken into account.

If not then:[?] a pair (ij) is the same event that a pair(ji), and in such a case the factor 2 must be used.

Whiles, the order of selection is important then:[?] a pair (ij) is NOT identical (is a separate event) to a pair(ji), and in such a case:

Pij=pipj[?] [?]and:[?][?] Pji=pjpi

So,[?]

In these two papers a pair (ij) might have been defined in a different ways