Question

Blastclust Has Been Depreciated. Does Anyone Know Why?

5

Entering edit mode

12.1 years ago

tyler.weirick ▴ 120

I am working on machine learning project using SVMs. One of the steps in the preparation of my data sets is to reduce the sequence similarity in each class to 40%. I have compared CD-HIT and BLASTCLUST for this step. BLASTCLUST keeps more sequences that CD-HIT. It is tempting to use this data as larger data sets are preferable for my work, but I am worried as BLASTCLUST has been depreciated from the blast+ package. Does anyone know why blastclust was depreciated? Or why I am getting significantly more clusters from BLASTCLUST vs CD-HIT?

clustering blast+ • 8.8k views

ADD COMMENT • link updated 10.9 years ago by Jose Manuel Duarte ▴ 340 • written 12.1 years ago by tyler.weirick ▴ 120

score 5 · Answer 1 · 2014-08-05

In my understanding BLASTCLUST and CD-HIT are algorithmically quite different. BLASTCLUST does clustering by doing the exhaustive BLAST all-to-all pairwise alignments, that means that it is slow but accurate. In contrast CD-HIT clusters by using heuristics to find high identity segments, that makes it very fast but not as exact as BLASTCLUST.

So I think there's 2 different kind of target use-cases for both programs. For instance I use BLASTCLUST to cluster sequences from the PDB since it is accurate and the number of sequences is not so enormous (around 100,000 at the moment) so it only takes a few hours to run.

That's why I upvoted the question, in my opinion it is indeed an issue for the community that BLASTCLUST is now deprecated.

score 2 · Answer 2 · 2013-06-12

I am not certain why BLAST-CLUST has been depreciated but there are good archives for BLAST legacy versions. It may have been depreciated because there are numerous clustering programs which have changed in the last 5 years, such as CD-HIT, UCLUST, etc., and no one has decided to develop or maintain new versions of BLAST-CLUST.

I am not sure why you are getting significantly more clusters using BLAST-CLUST than CD-HIT. You did not provide us any information on the extent of how many more sequences are in your BLAST-CLUST computation vs. CD-HIT computation. Even with the exact sequences you may have different clustering based on the algorithm differences between BLAST-CLUST and CD-HIT, so if you have different sequences you will obviously have different clustering schemes. Have you tried other clustering programs using the exact same sequences?

score 1 · Answer 3 · 2013-06-12

1

Entering edit mode

12.1 years ago

cacaucenturion ▴ 250

You can try BLAST2.2.14, it contains BLASTCLUST!

ADD COMMENT • link 12.1 years ago by cacaucenturion ▴ 250