Question

Basic question about protein distance with multiple sequence alignment

0

Entering edit mode

8.0 years ago

bds1217 • 0

I have what is probably a very basic question about how substitution models play in to obtaining distance measures between protein sequences.

To briefly summarize what I am trying to achieve: I have a large data set of sequences for a single, highly polymorphic protein across multiple samples. I am trying to cluster these various protein sequences into a set of clusters which I can then use to compare the similarity of the different samples. My current approach is to align all of the protein sequences (using Clustal or MUSCLE), obtain a distance matrix, generate clusters from that matrix, and then restructure the data based on those clusters.

While I don't necessarily need to generate a dendrogram of these sequences, I have been told that testing multiple substitution models after alignment, using something like ProtTest, is helpful for obtaining the most accurate distance between sequences.

So this is my confusion. As far as I can tell, alignment algorithms like Clustal utilize a substitution matrix to generate a distance matrix that is the basis of the alignment (which I believe in the case of ClustalOmega is a Gonnett matrix). This alignment is then fed to ProtTest to model different substitution matrices. So my question is, how does the use of a particular substitution matrix for the alignment algorithm affect the modeling of substitution matrices after the alignment is completed? Is this even the correct way to go about finding the optimal distance measure between protein sequences?

Thanks!

alignment phylogenetics clustering • 2.4k views

ADD COMMENT • link updated 8.0 years ago by Jean-Karim Heriche 27k • written 8.0 years ago by bds1217 • 0

score 1 · Answer 1 · 2016-04-22

Alignment algorithms optimize a scoring function which depends on the substitution matrix. So what you can do after obtaining an alignment is test different models/substitution matrices to see which one fits best your alignment. This approach is usually done for building a phylogenetic tree from an alignment.

In your case, you want to measure similarity between samples and cluster them using protein sequences. What you should consider is what makes two samples similar for you ? Is it the number of mutations ? Or where these occur in the sequence ? Or how mutations can derive from each other ? The first case is trivial, the second could be approached by subdividing the proteins into domains, the third one is the one requiring a phylogenetic approach but even then you could probably dispense with some of the complexity of a phylogenetic analysis by making simplifying assumptions. For example, you could assume that only one substitution occurs per site. If there are few deletions, you could also align without gaps which would ensure that the scores are true distance measures (i.e. metrics).