Question: How Log-Odds Scores For Amino Acid Substitutions Matrices Should Be Calculated?
gravatar for Jan Kosinski
9.3 years ago by
Jan Kosinski1.6k
Jan Kosinski1.6k wrote:

I got puzzled by some inconsistencies between published equations for calculating log-odds scores for substitutions matrices such as Blosum62.

In the seminal paper deriving the Blosum62 matrix the odds ratio used to calculate the log-odds score is: qij/eij where qij is observed probability of occurrence of the ij pair, and eij is expected probability of occurrence of the ij pair. According to this work, eij = PiPj for i = j and PiPj + PjPi = 2PiPj. This is derived from basic theory of probability.

In many other works, however, starting I think from Altschul 1991 the equation used for odds ratio is slightly different: qij/PiPj, that is the 2 in the denominator disappeared. This new equation has been derived from the statistical framework that Karlin and Altschul have developed.

What puzzles me that although both equations are similar, the Altschul ratio is exactly twice the value of Henikoff's ratio.

I know that log-odds are rescaled before putting to the substitution table by a factor of 1/lambda, but missing log(1/2) is not rescaling, it means that the final rescaled log-odds value from Altschul is bigger than Henikoff's by x/lambda (x depends on the base of the logarithm, which does not matter and is different in different works).

And this is the Altschul ratio that is more widely used in the following years up to now. The Henikoff's equation is, however, still used in some of the papers that derive purpose-specific substitution tables (e.g. in 'Structure-derived substitution matrices for alignment of distantly related sequences' and 'AN AMINO ACID SUBSTITUTION MATRIX FOR PROTEIN CONFORMATION IDENTIFICATION').

What's even more puzzling, that several articles claim that Blosum62 has been derived from Altschul-like odds ratio, forgetting the "2" in the denominator:

Where did the BLOSUM62 alignment score matrix come from?

Is everything OK here? Do the ratios mean the same thing? Isn't it wrong saying that Blosum62 has been derived from Altschul-like ratio?

sequence alignment scoring • 8.5k views
ADD COMMENTlink modified 6.9 years ago by Biostar ♦♦ 20 • written 9.3 years ago by Jan Kosinski1.6k
gravatar for Damian Kao
9.2 years ago by
Damian Kao15k
Damian Kao15k wrote:

2PiPj for when i is not equal j doesn't seem right to me even though it is in the paper.

In the paper, he gives an example where there are 9 'A' residues and 1 'S' residue in a column of 10 sequences.

He calculates there are 45 possible pairs in the 10 sequences (10 + 9 + 8 + 7 + 6...). He then says observed probability (Qij) of A-S or S-A to be 9 / 45 and A-A is 36/45.

If he calculated 45 possible pairs for his Qij, then order of the sequence in the pair does not matter. Seq1 vs Seq10 is same as Seq10 vs Seq1.

He then goes on to calculate the expected probabilities (Pi and Pj) which is 9/10 for 'A' and 1/10 for 'S'.

He says when i = j, the expected probability (Eij) is Pi * Pj. When i != j, expected probability is Pi * Pj * 2. In his example, to get Eij for A-S, you would: 9/10 * 1/10 * 2. Basically, the probability of A-S + the probability of S-A.

So for calculating the expected probabilities, the order suddenly matters now?

The Pi*Pj without multiplying by two, described by [?]NCBI Blast[?] seems more correct to me.

On a semi-related note, you might find this interesting:

ADD COMMENTlink modified 17 months ago by Ram32k • written 9.2 years ago by Damian Kao15k

Isn't that Eij = PiPj + PjPi = 2PiPj actually means that the order does not matter? It is essentially equivalent to rolling a loaded dice twice ... (Our case is the last paragraph). In our case, we have two events in which the outcome "i aligned with j in any order" pair can occur (i,j) and (j,i). The probability of occurring of this outcome is equal to probability of seeing (i,j) or (j,i), that is PiPj + PjPi...

ADD REPLYlink written 9.2 years ago by Jan Kosinski1.6k

I see your point. You are saying by calculating both reciprocal combinations (A-S and S-A), we are accounting for both orders, making the order not matter. But aren't we giving importance to order just by claiming A-S and S-A to be separate events? Don't we need to also multiply A-A by 2? A-A from Seq1 vs Seq10 would be considered a separate event from A-A from Seq10 vs Seq1. Honestly, I am not completely sure. But it's an interesting topic. Thanks for bringing this up.

ADD REPLYlink written 9.2 years ago by Damian Kao15k

You cannot distinguish between A-A from Seq1 vs Seq10 and A-A from Seq10 vs Seq1, this is a single event. Like when rolling a dice twice there is only one event of rolling (5,5) in the sample space.

ADD REPLYlink written 9.2 years ago by Jan Kosinski1.6k

And to be consistent with proper vocabulary, I think have misused "outcome" instead of "event" and vice versa. So A-A, A-S, S-A are outcomes and the event is "i aligned with j in any order". I had my Math lessons in Polish, sorry ;-)

ADD REPLYlink written 9.2 years ago by Jan Kosinski1.6k

I guess the dice analogy isn't very relevant to our situation. We are really talking about two separate types of combinations. There are combinations in terms of the residues and combinations in terms of the sequences. The dice analogy is looking at combination of the resulting dice rolls. We should really roll 10 separate dice once instead of rolling 1 dice 10 times.

ADD REPLYlink written 9.2 years ago by Damian Kao15k

From the dice example: "When rolling two dice, distinguish between them in some way: a first one and second one, a left and a right, a red and a green, etc.". So in that example they also have "combinations in terms of the residues" (residues = dice integers) and "combinations in terms of the sequences" (sequences = differently colored separate dice). In our example of alignment of aa pairs we in fact roll two separate differently colored dice once. The dice have 20 faces, each face is aa type, and each face is loaded so to make a face occur with Pi.

ADD REPLYlink written 9.2 years ago by Jan Kosinski1.6k
gravatar for Marcinpawlowski
9.2 years ago by
Marcinpawlowski10 wrote:

Everything depends if the order of selection is taken into account.

If not then:[?] a pair (ij) is the same event that a pair(ji), and in such a case the factor 2 must be used.

Whiles, the order of selection is important then:[?] a pair (ij) is NOT identical (is a separate event) to a pair(ji), and in such a case:

Pij=pipj[?] [?]and:[?][?] Pji=pjpi


In these two papers a pair (ij) might have been defined in a different ways

ADD COMMENTlink modified 9.2 years ago • written 9.2 years ago by Marcinpawlowski10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1343 users visited in the last hour