I am working on machine learning project using SVMs. One of the steps in the preparation of my data sets is to reduce the sequence similarity in each class to 40%. I have compared CD-HIT and BLASTCLUST for this step. BLASTCLUST keeps more sequences that CD-HIT. It is tempting to use this data as larger data sets are preferable for my work, but I am worried as BLASTCLUST has been depreciated from the blast+ package. Does anyone know why blastclust was depreciated? Or why I am getting significantly more clusters from BLASTCLUST vs CD-HIT?