Question: Clustering Large Dataset In Terms Of Sequence Similarity Using Eg... Blastclust
gravatar for kajendiran56
7.6 years ago by
kajendiran56120 wrote:

Dear All, thank you for your time. I have a dataset containing 15,000 sequences. I wish to build a tree and thus my plan was to use BlastClust, a module in the Blast application to cluster them, then use a reference sequence from each cluster to build a crude tree. BlastClust has been running for some time now but I have no idea whether this is going to work or how long it will take.

I was wondering if there are any other ways of going about this with a such a large set of sequences?

Ideally, I wanted to be able to do a sequence alignment and then use that alignment of build a tree (which I agree will be complex with that number of sequences) and then look at the evolution of those sequences.

I tried something called MAFFT to do the sequence alignment, which did not give me any errors but gave me no output.

Any suggestions would be appreciated.

ADD COMMENTlink modified 5.8 years ago by Biostar ♦♦ 20 • written 7.6 years ago by kajendiran56120
gravatar for Andreas
7.6 years ago by
Andreas2.4k wrote:

The classical approach for creating a tree would be to compute an alignment and a tree from that. However for such a large number of sequences you have to use some tricks. I guess BlastClust is one of them, but it really depends what you need this tree for. Depending on the application CD-Hit (see Chris' post) or UCLUST/USEARCH are alternatives.

If you want to stick to the classical approach, which needs an alignment, then your only options are MAFFT (make sure to use it's Part-Tree module!) and Clustal Omega and they will only work with sequences of reasonable small size. Once you have an alignment the tree building needs to be done with something real fast as well, one option for computing NJ trees is FastTree.


PS: You might want to have a look at the just published paper Ultrafast clustering algorithms for metagenomic sequence analysis by Li et al. if you're dealing with NGS sequencing data, especially from Metagenomics

ADD COMMENTlink modified 7.6 years ago • written 7.6 years ago by Andreas2.4k

Thank you for your extensive suggestions. I have managed to use CD-HIT but you are correct in that this is not ideal. I managed to use Clustal Omega to build an alignment and I will use FastTree as you have suggested. Although I am not using NGS data, I will look at the paper you have suggested as well. Thank you once again

ADD REPLYlink written 7.6 years ago by kajendiran56120
gravatar for Chris
7.6 years ago by
Chris1.6k wrote:

Have a look at CD-HIT [1]. Should take only minutes for that much sequences to cluster.


ADD COMMENTlink written 7.6 years ago by Chris1.6k

Thank you for your suggestion. I have used this effectively to build a crude tree, I am amazed at how quickly it does this. Thank you

ADD REPLYlink written 7.6 years ago by kajendiran56120
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1106 users visited in the last hour