I have what is probably a very basic question about how substitution models play in to obtaining distance measures between protein sequences.
To briefly summarize what I am trying to achieve: I have a large data set of sequences for a single, highly polymorphic protein across multiple samples. I am trying to cluster these various protein sequences into a set of clusters which I can then use to compare the similarity of the different samples. My current approach is to align all of the protein sequences (using Clustal or MUSCLE), obtain a distance matrix, generate clusters from that matrix, and then restructure the data based on those clusters.
While I don't necessarily need to generate a dendrogram of these sequences, I have been told that testing multiple substitution models after alignment, using something like ProtTest, is helpful for obtaining the most accurate distance between sequences.
So this is my confusion. As far as I can tell, alignment algorithms like Clustal utilize a substitution matrix to generate a distance matrix that is the basis of the alignment (which I believe in the case of ClustalOmega is a Gonnett matrix). This alignment is then fed to ProtTest to model different substitution matrices. So my question is, how does the use of a particular substitution matrix for the alignment algorithm affect the modeling of substitution matrices after the alignment is completed? Is this even the correct way to go about finding the optimal distance measure between protein sequences?
Thanks!
This is what I have been doing so far. Though, through some empirical testing, I have noticed that the substitution matrix that Clustal uses can affect which model best fits the alignment after the alignment is done. For instance, with one data set, if I use the default Gonnet matrix in ClustalOmega, ProtTest returns a JTT matrix as the best fit on the resulting alignment. However, if I make ClustalOmega use A BLOSUM matrix, ProtTest might tell me a WAG matrix is the best fit of the alignment. So how much attention do you pay to which substitution model you use for the initial alignment? Is the default model generally fine for Clustal or Muscle? Or are there particular situations where you choose a particular substitution model within Clusal?
Thanks!