I have what is probably a very basic question about how substitution models play in to obtaining distance measures between protein sequences.
To briefly summarize what I am trying to achieve: I have a large data set of sequences for a single, highly polymorphic protein across multiple samples. I am trying to cluster these various protein sequences into a set of clusters which I can then use to compare the similarity of the different samples. My current approach is to align all of the protein sequences (using Clustal or MUSCLE), obtain a distance matrix, generate clusters from that matrix, and then restructure the data based on those clusters.
While I don't necessarily need to generate a dendrogram of these sequences, I have been told that testing multiple substitution models after alignment, using something like ProtTest, is helpful for obtaining the most accurate distance between sequences.
So this is my confusion. As far as I can tell, alignment algorithms like Clustal utilize a substitution matrix to generate a distance matrix that is the basis of the alignment (which I believe in the case of ClustalOmega is a Gonnett matrix). This alignment is then fed to ProtTest to model different substitution matrices. So my question is, how does the use of a particular substitution matrix for the alignment algorithm affect the modeling of substitution matrices after the alignment is completed? Is this even the correct way to go about finding the optimal distance measure between protein sequences?