Question: Get summary of aligned sequences
gravatar for Chirag Parsania
4 months ago by
Chirag Parsania1.9k
University of Macau
Chirag Parsania1.9k wrote:


I have aligned a set of ~3000 protein sequences. Next, I want to do is to generate a phylogenetic tree. Before that, I want to check the quality of the alignment. Is there any way to get a summary of aligned sequences quickly? Also, can anyone suggest a tool to process the aligned sequences before I do the phylogenetic analysis? Currently, I am using trimAL to trim the alignment.

Thanks, Chirag.

alignment sequence fasta • 205 views
ADD COMMENTlink written 4 months ago by Chirag Parsania1.9k

Do you mean a multiple sequence alignment of 3000 proteins? Sounds like a huge number

ADD REPLYlink written 4 months ago by Asaf8.4k

Yes. I want to generate statistics out of that. For example, the distribution of gaps, number of conserved columns etc.

ADD REPLYlink written 4 months ago by Chirag Parsania1.9k

That's a huge number. Unless they are almost identical I expect this alignment to be misleading. Consider dividing into clusters before aligning.

ADD REPLYlink written 4 months ago by Asaf8.4k

Can you elaborate a little more on generating clusters? I mean how can I do that and with the clusters how to proceed with downstream phylogeny.

ADD REPLYlink written 4 months ago by Chirag Parsania1.9k

I believe what Asaf meant is to perform clustering of the protein sequences first; i.e. cluster similar protein sequences into clusters that meet a user-defined similarity threshold. This could be achieved using clustering tools such as CD-HIT; have a look this.

After that, align them, each cluster separately.

ADD REPLYlink modified 3 months ago • written 3 months ago by lakhujanivijay5.2k

Usually using biological knowledge. I don't know what you're trying to achieve but if, for instance, one would like to generate a phylogenetic tree of 3,000 bacterial species based on one protein the strategy I suggest would be to take each phylum and generate a tree and then combine the trees. I think there should be a balance between brute-force methods (let's feed the algorithm with everything) and refined understanding of the problem.

ADD REPLYlink written 4 months ago by Asaf8.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1889 users visited in the last hour