So I have generated a large fastA file where I took 2 protein domains from a protein sequence. To be clear, I had an initial protein sequence like AAAAAAABBBBBBCCCCCDDDDDDEEEEEE and I created a new sequence that was BBBBBBDDDDDD.
Now, with Mega, I used Muscle to align all my sequences, and then generated a Maximum Likelihood Tree. Many of sequences were redundant or very similar. What I would like to do is to group these sequences together into clusters based on how different they are in sequence. So if I have ~300 sequences, I would like to group them into ~30 clusters. How would I go about doing this? How can I get a measure of how different the sequences are in absolute terms, not just the binning process by these trees where it gives me value for branch length?
Thanks so much!