Entropy From A Multiple Sequence Alignment With Gaps
2
3
Entering edit mode
10.5 years ago

Shannon's entropy is a quantitative measure of uncertainty in a data set.In most instances it is possible to calculate the entropy scores for a multiple sequence alignment(MSA). What i would like to understand is how to take care of the gaps and how to perform a correction in case of gaps in the MSA. Does it matter?(i think it does) how do you interpret the scores in the presence of gaps?

multiple • 12k views
5
Entering edit mode
10.5 years ago
Bilouweb ★ 1.1k

When you calculate Shannon's entropy, you consider an alphabet of 21 symbols (20 amino acids and a gap symbol). The problem is that a column full of gaps is conserved (entropy is high).

I found a good way to take in account gaps in the paper from William Valdar : Scoring Residue Conservation.

I calculate the entropy with a function which takes in account sequence weights and amino acid frequencies (t(x) where x is a column). Then I calculate the proportion of gaps in the column (g(x)) and finaly, my score is S = (1-t(x)) * (1-g(x))

0
Entering edit mode

with the function from Valdar, you can also take in account the stereochemical nature of amino acids.

1
Entering edit mode
10.5 years ago
Dave Lunt ★ 2.0k

There is quite a nice description on pg 119 of the BioEdit documentation pdf. In short you can either define how many character states are possible at that position, or work from the number of observed character states. In either case gaps are fine, the second approach also deals with ambiguity codes etc.

In terms of interpreting the scores (from the same pdf)...

"An entropy plot can give an idea of the amount of variability through a column in an alignment. It is a measure of the lack of “information content” at each position in the alignment. More accurately, it is a measure of the lack of predictability for an alignment position. If there are x sequences in an alignment (say x = 40 sequences) of DNA sequences, and at position y (say y = position 5) there is an ‘A’ in all sequences, we can assume we have a lot of information for position 5 and chances are if we had to guess at the base at position 5 of another homologous sequence, we would be correct to guess ‘A’. We have maximum “information” for position 5, and the entropy is 0. Now, if there are four possibilities for each position (A, G, C or T) and each occurs at position 5 with a frequency of 0.25 (equally probable), then our information content (how well we could predict the position for a new incoming sequence) has been reduced to 0, and the entropy is at maximum variability."

0
Entering edit mode

Sorry i was not clear. what i mearnt is interpreting the scores in the presence of gaps.