Similarity Score Of Multiple Sequence Alignment
3
3
Entering edit mode
10.8 years ago
Ananth ▴ 70

Hello,

I have a file with protein sequences for which I would like to know the similarity score of the multiple sequence alignment.

I have aligned these sequences using ClustalW, but all I get is the pairwise identity score !

I am not looking for the pairwise identity or similarity score, but the similarity score of the multiple sequence alignment, so that I can conclude that "this group of sequences are x% similar with each other".

Is there any tool that gives a measure of similarity of the sequences ? Or any method for calculating this score ?

Thank you, Ananth

multiple • 20k views
1
Entering edit mode

the similarity score depends on the substitution matrix used. So you should never say "this group of sequences are x% similar with each other" but rather "this group of sequences are x% similar with each other given this specific substitution matrix". Moreover, check you are doing a global alignment and not a local one.

0
Entering edit mode

Thank you Giovanni,

As you correctly pointed out, yes for a specific substitution matrix in a global alignment is there a way to calculate this similarity score for a MSA ?!

0
Entering edit mode

how can run the MstatX.

I have try this command but not working

Mstat -m test.fa -s trident


could u please give me example for command

0
Entering edit mode

./mstatx -i test.fa -s trident -g

8
Entering edit mode
10.8 years ago
Bilouweb ★ 1.1k

I have made a tool to derive statistics from a multiple alignment. It gives a score for each column of the multiple alignment given a substitution matrix. Here is the link (github) : MstatX. (use the -s trident option)

Hope it can help. If you need any help, just ask.

EDIT : The question "How to measure the conservation (or similarity) in a multiple alignment is quite difficult as it is discussed in these questions : Conservation Score Of Amino Acid Positions In Human Proteins and Entropy From A Multiple Sequence Alignment With Gaps

A first measure can be calculated by the following algorithm (the famous sum of pairs):

Msa msa;
float total = 0.0;
for (c = 0; c < nb_column; ++c) {
float sum = 0.0;
for (r = 0; r < nb_row - 1; ++r){
for (s = r + 1; s < nb_row; ++s){
sum += similarity_score(msa[c][r],msa[c][s]);
}
}
total += sum / (nb_row *(nb_row -1) / 2);
}
total /= nb_column;


Where the similarity_score is your scoring matrix.

1
Entering edit mode

1
Entering edit mode

Thanks bilouweb ! It was helpful. :)

1
Entering edit mode

Is it possible for MstatX to output a final MSA score?

1
Entering edit mode

Is it possible for MstatX to output a final MSA score? When I ran it, I could only find ways to output per-column scores. Thank you for the software package!

1
Entering edit mode

Thanks for using MstatX ! I can add a total score as a mean of all scores. I will also add a DNA matrix for multiple alignments of dna.

1
Entering edit mode

Thanks! I think it would be helpful to have a total score too, similar to the one that Clustal or MUSCLE would output.

0
Entering edit mode

what is the difference between wentropy and trident statistics?

6
Entering edit mode
10.8 years ago

I think the answer is "no". The reason is that I cannot think of a meaningful way to define the % identity of a multiple sequence alignment.

If one defines it as as the fraction of aligned positions that are identical across all sequences, the % identity would automatically be lower the more sequences you have in the alignment. It would thus not be comparable between different alignments.

1
Entering edit mode
10.8 years ago

Depending on what you mean by 'measure of similarity'. PAM value if a protein alignment? Global %identity?

Look at Sean Eddy's tools. alistat, (build from the SQUID package) might meet your needs. It is also installed as part of the HMMER package.

0
Entering edit mode

Thank you Alastair

As for a pairwise sequence alignment ClustalW indicates the sequence identity by a score which shows the percentage identity shared between the 2 sequences.

By the measure of similarity what I meant was, instead having a score that is for 2 sequences, can we have a score that gives an idea of similarity of the multiple sequence alignment ?