Question: python calculation of protein multiple sequence alignment
0
ttubiana20 wrote:

Dear all :-)

I am trying to compute the conservation score of each position of a protein multiple sequence alignment. I already used the Shannon entropy, but I am not satisfied with it since it is not similarity-based but identity only. So I thought that maybe it could be a good idea to use a substitution matrix. I tried to implement two methods:

1. Protein–Protein Interfaces: Analysis of Amino Acid Conservation in Homodimers (https://doi.org/10.1002/1097-0134(20010101)42:1%3C108::AID-PROT110%3E3.0.CO;2-O)
2. the "sum-of-pairs" method from AL2CO (https://doi.org/10.1093/bioinformatics/17.8.700)

The first method gives me wrong results (maybe because I used BLOSUM62 instead of PET91 used in the article...). The second method (AL2CO) doesn't give me satisfying results.

In practice, I would like a score in [0,1] with some sensitivity to sequence redundancy. I have a workflow in python that process my alignment and calculate properties, so I try as much as possible to avoid external tools...

Do you have some bits of advice or maybe a hidden magick package that I didn't found :-)?

Best,
Thibault.

alignment python conservation • 106 views
modified 27 days ago by Macspider3.2k • written 28 days ago by ttubiana20
4
Macspider3.2k wrote:

I think what you're looking for is contained in `scipy.spatial.distance`:

https://docs.scipy.org/doc/scipy/reference/spatial.distance.html

Among these measurements, I would suggest you to have a look at the Jaccard distance. It fits your requirement of the [0,1] range, and it's pretty easy to calculate. A Jaccard distance is calculated as the intersection over the union of two sets.

Since you're calculating distance between protein sequences in an MSA, you could consider each protein of a pairwise comparison as a set of ordered characters. The intersection of the two sets is the equal characters (i.e. the equal amino acids at the same position) between the two sets. The union is the total length of the pairwise alignment, gaps included.

EDIT:

One clarification: this calculation for J gives you a value that is closer to 1 the closer sequences are. So if you are measuring distance, you will have to do 1-J.

Thank you very much for your answer :-) I will have a look to it and try to implement it.
I still need a "similarity" based score, but I didn't think of Distance scipy package. I can maybe find a way to use one of the methods 🤔 combine to a BLOSUM matrix...

My second "easy" option is to use a 10 group classification for amino acids as described here: http://thegrantlab.org/bio3d/reference/entropy.html. `Hydrophobic/Aliphatic [V,I,L,M], Aromatic [F,W,Y], Ser/Thr [S,T], Polar [N,Q], Positive [H,K,R], Negative [D,E], Tiny [A,G], Proline [P], Cysteine [C], and Gaps [-,X].`

Best, Thibault.