I got puzzled by some inconsistencies between published equations for calculating log-odds scores for substitutions matrices such as Blosum62.

In the seminal paper deriving the Blosum62 matrix the odds ratio used to calculate the log-odds score is: qij/eij where qij is observed probability of occurrence of the ij pair, and eij is expected probability of occurrence of the ij pair. According to this work, eij = PiPj for i = j and PiPj + PjPi = 2PiPj. This is derived from basic theory of probability.

In many other works, however, starting I think from Altschul 1991 the equation used for odds ratio is slightly different: qij/PiPj, that is the 2 in the denominator disappeared. This new equation has been derived from the statistical framework that Karlin and Altschul have developed.

What puzzles me that although both equations are similar, the Altschul ratio is exactly twice the value of Henikoff's ratio.

I know that log-odds are rescaled before putting to the substitution table by a factor of 1/lambda, but missing log(1/2) is not rescaling, it means that the final rescaled log-odds value from Altschul is bigger than Henikoff's by x/lambda (x depends on the base of the logarithm, which does not matter and is different in different works).

And this is the Altschul ratio that is more widely used in the following years up to now. The Henikoff's equation is, however, still used in some of the papers that derive purpose-specific substitution tables (e.g. in 'Structure-derived substitution matrices for alignment of distantly related sequences' and 'AN AMINO ACID SUBSTITUTION MATRIX FOR PROTEIN CONFORMATION IDENTIFICATION').

What's even more puzzling, that several articles claim that Blosum62 has been derived from Altschul-like odds ratio, forgetting the "2" in the denominator:

Where did the BLOSUM62 alignment score matrix come from?

http://en.wikipedia.org/wiki/BLOSUM

Is everything OK here? Do the ratios mean the same thing? Isn't it wrong saying that Blosum62 has been derived from Altschul-like ratio?

Isn't that Eij = PiPj + PjPi = 2PiPj actually means that the order does not matter? It is essentially equivalent to rolling a loaded dice twice ... http://www.math.hawaii.edu/~ramsey/Probability/TwoDice.html (Our case is the last paragraph). In our case, we have two events in which the outcome "i aligned with j in any order" pair can occur (i,j) and (j,i). The probability of occurring of this outcome is equal to probability of seeing (i,j) or (j,i), that is PiPj + PjPi...

I see your point. You are saying by calculating both reciprocal combinations (A-S and S-A), we are accounting for both orders, making the order not matter. But aren't we giving importance to order just by claiming A-S and S-A to be separate events? Don't we need to also multiply A-A by 2? A-A from Seq1 vs Seq10 would be considered a separate event from A-A from Seq10 vs Seq1. Honestly, I am not completely sure. But it's an interesting topic. Thanks for bringing this up.

You cannot distinguish between A-A from Seq1 vs Seq10 and A-A from Seq10 vs Seq1, this is a single event. Like when rolling a dice twice there is only one event of rolling (5,5) in the sample space.

And to be consistent with proper vocabulary, I think have misused "outcome" instead of "event" and vice versa. So A-A, A-S, S-A are outcomes and the event is "i aligned with j in any order". I had my Math lessons in Polish, sorry ;-)

I guess the dice analogy isn't very relevant to our situation. We are really talking about two separate types of combinations. There are combinations in terms of the residues and combinations in terms of the sequences. The dice analogy is looking at combination of the resulting dice rolls. We should really roll 10 separate dice once instead of rolling 1 dice 10 times.

From the dice example: "When rolling two dice, distinguish between them in some way: a first one and second one, a left and a right, a red and a green, etc.". So in that example they also have "combinations in terms of the residues" (residues = dice integers) and "combinations in terms of the sequences" (sequences = differently colored separate dice). In our example of alignment of aa pairs we in fact roll two separate differently colored dice once. The dice have 20 faces, each face is aa type, and each face is loaded so to make a face occur with Pi.