Protein Sequence Clustering (CD-HIT vs BLASTClust + Problems in BlastClust)
7.6 years ago
BDK_compbio ▴ 130

I used both BlastClust and CD-HIT for clustering protein sequences. The output of CD-HIT looks good but it is creating too many clusters (6000 clusters for 18000 protein sequences). When I used Blastclust, I got less number of sequences but I found a major bug. The first cluster is always contain maximum number of sequences and all others contain very very less number of sequences (1405 number of sequences in first cluster and less than 40 in all others). When I checked the sequences of first cluster manually, I found that they are not similar sequences at all. But other clusters except first one seem to be good. I am just wondering if anybody had similar problem. Should I use CD-HIT instead of BlastClust? Is there any better tool for protein Sequence clustering?

BlastClust Clustering blast CD-HIT • 4.9k views
Have you tried to simply increase the identity threshold when using CDhit?

Also, what is the biological reasoning behind clustering? It sound like these sequences are already highly similar. Are you trying to divide them by certain mutation or something?

Yes, I ran CD-HIT with Sequence identity threshold 0.9, 0.8, 0.7, 0.6, 0.5 and 0.4. I am still getting large number of clusters. I am clustering the homologous genes of three different viruses.

7.6 years ago
h.mon 34k

For amino-acids, a faster alternative is USEARCH. BLASTClust is from the older and deprecated BLAST suite, so few chances of bugfixes (and I am not certain you found a bug).

You do not provide the commands you executed, nor examples of the dissimilar sequences, so I have no clue if your results are good or not.

edit: your post does not seem fit to the "tool" category, it is a regular question.

7.6 years ago
BDK_compbio ▴ 130

For BlastClust, I tried both command line (blastclust -i infile -o outfile -p T -L 1 -b T -S 75) and web-server (http://toolkit.tuebingen.mpg.de/blastclust). I also took all sequences of first cluster and run blastclust for them, I found similar output i.e. 700 sequences in first cluster and less than 20 in all other sequences. I took two random sequences from the first cluster and did the pairwise alignment and they are not similar.

I will try using USEARCH and see if that gives some good result.