Question: Blastclust Has Been Depreciated. Does Anyone Know Why?
gravatar for tyler.weirick
6.1 years ago by
tyler.weirick120 wrote:

I am working on machine learning project using SVMs. One of the steps in the preparation of my data sets is to reduce the sequence similarity in each class to 40%. I have compared CD-HIT and BLASTCLUST for this step. BLASTCLUST keeps more sequences that CD-HIT. It is tempting to use this data as larger data sets are preferable for my work, but I am worried as BLASTCLUST has been depreciated from the blast+ package. Does anyone know why blastclust was depreciated? Or why I am getting significantly more clusters from BLASTCLUST vs CD-HIT?

clustering blast+ • 4.8k views
ADD COMMENTlink modified 5.0 years ago by Jose Manuel Duarte290 • written 6.1 years ago by tyler.weirick120
gravatar for Jose Manuel Duarte
5.0 years ago by
Jose Manuel Duarte290 wrote:

In my understanding BLASTCLUST and CD-HIT are algorithmically quite different. BLASTCLUST does clustering by doing the exhaustive BLAST all-to-all pairwise alignments, that means that it is slow but accurate. In contrast CD-HIT clusters by using heuristics to find high identity segments, that makes it very fast but not as exact as BLASTCLUST.

So I think there's 2 different kind of target use-cases for both programs. For instance I use BLASTCLUST to cluster sequences from the PDB since it is accurate and the number of sequences is not so enormous (around 100,000 at the moment) so it only takes a few hours to run.

That's why I upvoted the question, in my opinion it is indeed an issue for the community that BLASTCLUST is now deprecated.

ADD COMMENTlink written 5.0 years ago by Jose Manuel Duarte290
gravatar for Josh Herr
6.1 years ago by
Josh Herr5.6k
University of Nebraska
Josh Herr5.6k wrote:

I am not certain why BLAST-CLUST has been depreciated but there are good archives for BLAST legacy versions. It may have been depreciated because there are numerous clustering programs which have changed in the last 5 years, such as CD-HIT, UCLUST, etc., and no one has decided to develop or maintain new versions of BLAST-CLUST.

I am not sure why you are getting significantly more clusters using BLAST-CLUST than CD-HIT. You did not provide us any information on the extent of how many more sequences are in your BLAST-CLUST computation vs. CD-HIT computation. Even with the exact sequences you may have different clustering based on the algorithm differences between BLAST-CLUST and CD-HIT, so if you have different sequences you will obviously have different clustering schemes. Have you tried other clustering programs using the exact same sequences?

ADD COMMENTlink modified 6.1 years ago • written 6.1 years ago by Josh Herr5.6k
gravatar for cacaucenturion
6.1 years ago by
cacaucenturion210 wrote:

You can try BLAST2.2.14, it contains BLASTCLUST!

ADD COMMENTlink written 6.1 years ago by cacaucenturion210
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1137 users visited in the last hour