python calculation of protein multiple sequence alignment
2
0
Entering edit mode
2.3 years ago
ttubiana ▴ 30

Dear all :-)

I am trying to compute the conservation score of each position of a protein multiple sequence alignment. I already used the Shannon entropy, but I am not satisfied with it since it is not similarity-based but identity only. So I thought that maybe it could be a good idea to use a substitution matrix. I tried to implement two methods:

1. Proteinâ€“Protein Interfaces: Analysis of Amino Acid Conservation in Homodimers (https://doi.org/10.1002/1097-0134(20010101)42:1%3C108::AID-PROT110%3E3.0.CO;2-O)
2. the "sum-of-pairs" method from AL2CO (https://doi.org/10.1093/bioinformatics/17.8.700)

The first method gives me wrong results (maybe because I used BLOSUM62 instead of PET91 used in the article...). The second method (AL2CO) doesn't give me satisfying results.

In practice, I would like a score in [0,1] with some sensitivity to sequence redundancy. I have a workflow in python that process my alignment and calculate properties, so I try as much as possible to avoid external tools...

Do you have some bits of advice or maybe a hidden magick package that I didn't found :-)?

Best,
Thibault.

python alignment conservation • 2.5k views
4
Entering edit mode
2.3 years ago
Macspider ★ 3.6k

I think what you're looking for is contained in scipy.spatial.distance:

https://docs.scipy.org/doc/scipy/reference/spatial.distance.html

Among these measurements, I would suggest you to have a look at the Jaccard distance. It fits your requirement of the [0,1] range, and it's pretty easy to calculate. A Jaccard distance is calculated as the intersection over the union of two sets.

Since you're calculating distance between protein sequences in an MSA, you could consider each protein of a pairwise comparison as a set of ordered characters. The intersection of the two sets is the equal characters (i.e. the equal amino acids at the same position) between the two sets. The union is the total length of the pairwise alignment, gaps included.

EDIT:

One clarification: this calculation for J gives you a value that is closer to 1 the closer sequences are. So if you are measuring distance, you will have to do 1-J.

0
Entering edit mode

Thank you very much for your answer :-) I will have a look to it and try to implement it.
I still need a "similarity" based score, but I didn't think of Distance scipy package. I can maybe find a way to use one of the methods đź¤” combine to a BLOSUM matrix...

My second "easy" option is to use a 10 group classification for amino acids as described here: http://thegrantlab.org/bio3d/reference/entropy.html. Hydrophobic/Aliphatic [V,I,L,M], Aromatic [F,W,Y], Ser/Thr [S,T], Polar [N,Q], Positive [H,K,R], Negative [D,E], Tiny [A,G], Proline [P], Cysteine [C], and Gaps [-,X].

Best, Thibault.

1
Entering edit mode
17 months ago

Hi,

maybe this may help

https://github.com/Cantalapiedra/msa_conservation_index

It is similarity based and yields values between 0 and 1.

Check

https://github.com/Cantalapiedra/msa_conservation_index#how-it-works

Best

0
Entering edit mode

Hey cool script, thanks for making it !