Basic question about protein distance with multiple sequence alignment
1
0
Entering edit mode
8.0 years ago
bds1217 • 0

I have what is probably a very basic question about how substitution models play in to obtaining distance measures between protein sequences.

To briefly summarize what I am trying to achieve: I have a large data set of sequences for a single, highly polymorphic protein across multiple samples. I am trying to cluster these various protein sequences into a set of clusters which I can then use to compare the similarity of the different samples. My current approach is to align all of the protein sequences (using Clustal or MUSCLE), obtain a distance matrix, generate clusters from that matrix, and then restructure the data based on those clusters.

While I don't necessarily need to generate a dendrogram of these sequences, I have been told that testing multiple substitution models after alignment, using something like ProtTest, is helpful for obtaining the most accurate distance between sequences.

So this is my confusion. As far as I can tell, alignment algorithms like Clustal utilize a substitution matrix to generate a distance matrix that is the basis of the alignment (which I believe in the case of ClustalOmega is a Gonnett matrix). This alignment is then fed to ProtTest to model different substitution matrices. So my question is, how does the use of a particular substitution matrix for the alignment algorithm affect the modeling of substitution matrices after the alignment is completed? Is this even the correct way to go about finding the optimal distance measure between protein sequences?

Thanks!

alignment phylogenetics clustering • 2.4k views
ADD COMMENT
1
Entering edit mode
8.0 years ago

Alignment algorithms optimize a scoring function which depends on the substitution matrix. So what you can do after obtaining an alignment is test different models/substitution matrices to see which one fits best your alignment. This approach is usually done for building a phylogenetic tree from an alignment.

In your case, you want to measure similarity between samples and cluster them using protein sequences. What you should consider is what makes two samples similar for you ? Is it the number of mutations ? Or where these occur in the sequence ? Or how mutations can derive from each other ? The first case is trivial, the second could be approached by subdividing the proteins into domains, the third one is the one requiring a phylogenetic approach but even then you could probably dispense with some of the complexity of a phylogenetic analysis by making simplifying assumptions. For example, you could assume that only one substitution occurs per site. If there are few deletions, you could also align without gaps which would ensure that the scores are true distance measures (i.e. metrics).

ADD COMMENT
0
Entering edit mode

what you can do after obtaining an alignment is test different models/substitution matrices to see which one fits best your alignment

This is what I have been doing so far. Though, through some empirical testing, I have noticed that the substitution matrix that Clustal uses can affect which model best fits the alignment after the alignment is done. For instance, with one data set, if I use the default Gonnet matrix in ClustalOmega, ProtTest returns a JTT matrix as the best fit on the resulting alignment. However, if I make ClustalOmega use A BLOSUM matrix, ProtTest might tell me a WAG matrix is the best fit of the alignment. So how much attention do you pay to which substitution model you use for the initial alignment? Is the default model generally fine for Clustal or Muscle? Or are there particular situations where you choose a particular substitution model within Clusal?

Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 2279 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6