Entering edit mode
4.5 years ago
Chirag Parsania
★
2.0k
Hi,
I have aligned a set of ~3000 protein sequences. Next, I want to do is to generate a phylogenetic tree. Before that, I want to check the quality of the alignment. Is there any way to get a summary of aligned sequences quickly? Also, can anyone suggest a tool to process the aligned sequences before I do the phylogenetic analysis? Currently, I am using trimAL
to trim the alignment.
Thanks, Chirag.
Do you mean a multiple sequence alignment of 3000 proteins? Sounds like a huge number
Yes. I want to generate statistics out of that. For example, the distribution of gaps, number of conserved columns etc.
That's a huge number. Unless they are almost identical I expect this alignment to be misleading. Consider dividing into clusters before aligning.
Can you elaborate a little more on generating clusters? I mean how can I do that and with the clusters how to proceed with downstream phylogeny.
I believe what Asaf meant is to perform clustering of the protein sequences first; i.e. cluster similar protein sequences into clusters that meet a user-defined similarity threshold. This could be achieved using clustering tools such as CD-HIT; have a look this.
After that, align them, each cluster separately.
Usually using biological knowledge. I don't know what you're trying to achieve but if, for instance, one would like to generate a phylogenetic tree of 3,000 bacterial species based on one protein the strategy I suggest would be to take each phylum and generate a tree and then combine the trees. I think there should be a balance between brute-force methods (let's feed the algorithm with everything) and refined understanding of the problem.