Make matrix of protein pairwise identities/similarities from multiple protein sequences
4
1
Entering edit mode
4.6 years ago
al-ash ▴ 190

Is there an already existing tool to generate a matrix of pairwise protein identities/similarities for an input which consists of multiple protein sequences?

I did not find a working solution for MAC OS/UNIX (the non-working solution for me is MatGAT for which I managed to find executables only for Windows OS).

I'm aware that parsing results from pairwise alignments of all pairwise combinations of proteins from the input file and arranging it into a table is one solution but I'm trying to avoid this at this point as it would take me, with my current skills, a lot of time to write such a script.

UPDATE To be more specific, I'm looking for % protein sequence identities from global sequence alignment (such as the % similarities/identities retrieved by https://www.ebi.ac.uk/Tools/psa/emboss_needle/)

pairwise protein identity similarity matrix • 9.7k views
1
Entering edit mode
4.6 years ago
Bill Pearson ★ 1.0k

Phylip uses its own special interleaved sequence alignment, which is definitely neither FASTA format nor CLUSTAL format, but you can find programs that will convert. Phylip format is well known and quite old (1980's).

The advantage of Phylip's protdist over clustal's is that it gives corrected (scaled) protein distances, not raw similarities/distances. As protein similarities go down, (< 50% identity, which is very high for proteins), the distances go up exponentially, so that a 50% identical sequence might have a distance of PAM70, while a 30% identical sequence could be PAM160, and 20% identity PAM250. protdist does the conversion from observed protein distance to corrected evolutionary distances, using one of several evolutionary models.

1
Entering edit mode
3.9 years ago
al-ash ▴ 190

I ended up with the following command line solution using clustal omega which converts distance matrix to percent identity matrix:

clustalo-1.2.4-Ubuntu-x86_64 --full --percent-id --distmat-out=output.distmat -i input.aa.fa

0
Entering edit mode

What is a good threshold on percent identity (produced by Clustal Omega) to tell two sequences are similar? What is the minimum identity that indicates a good match? How do you interpret the numbers? Thank you!

1
Entering edit mode

There is no magic number. It is context and question dependent, and different for protein and DNA. You have to decide what 'similarity' means in the context of your underlying question.

0
Entering edit mode

Thank you!

I was looking to solve a similar problem (make matrix table of percent identity/percent matching for every pairwise comparison of 189 peptide sequences, WITHOUT/BEFORE any multiple sequence alignment (MSA)).

The command line code/operation that you provided above worked well, thank you.

I used the Windows 64-bit precompiled binary of Clustal Omega downloaded from here: http://www.clustal.org/omega/

This readme webpage also has complementary details regarding the command subcomponents: https://github.com/hybsearch/clustalo/blob/master/README

"In order to produce a multiple alignment Clustal-Omega requires a guide tree which defines the order in which sequences/profiles are aligned. A guide tree in turn is constructed, based on a distance matrix. Conventionally, this distance matrix is comprised of all the pair-wise distances of the sequences. The distance measure Clustal-Omega uses for pair-wise distances of un-aligned sequences is the k-tuple measure [4], which was also implemented in Clustal 1.83 and ClustalW2 [5,6]..." etc.

--full

Use full distance matrix for guide-tree calculation (slow; mBed is default)

--percent-id

convert distances into percent identities (default no)

0
Entering edit mode
4.6 years ago
Joe 20k

Are you looking for something like a Position Specific Score Matrix? In which case, BioPython can build this for you already.

http://biopython.org/DIST/docs/api/Bio.Align.AlignInfo.PSSM-class.html

0
Entering edit mode
4.6 years ago
Bill Pearson ★ 1.0k

The Phylip program package (http://evolution.genetics.washington.edu/phylip/getme-new1.html), which uses an unfortunate format for multiple sequence alignment, includes "protdist", which does exactly what you want, and converts from observed distance to evolutionary distance.

0
Entering edit mode

Not using Phylip before, I'm a bit confused by their documentation - according to http://evolution.genetics.washington.edu/phylip/doc/protdist.html the "program uses protein sequences" which would evoke to me, that the inout is multifasta, but actually it seems that the input is rather multiple alignment, according to what you wrote (?) and also I'm not sure, that the can be % identities and/or similarities (please see my updated question, I was apparently not clear enough).

1
Entering edit mode

Clustal can report pairwise identities I believe, but it won’t write you a matrix, you’d still have to parse that out yourself.

1
Entering edit mode

You are right! Clustal Omega (https://www.ebi.ac.uk/Tools/msa/clustalo/) gives directly sequence %identity matrix (Result Summary -> Percent Identity Matrix in the web interface).

0
Entering edit mode

I take it all back then! I guess I was half right!