I used both BlastClust and CD-HIT for clustering protein sequences. The output of CD-HIT looks good but it is creating too many clusters (6000 clusters for 18000 protein sequences). When I used Blastclust, I got less number of sequences but I found a major bug. The first cluster is always contain maximum number of sequences and all others contain very very less number of sequences (1405 number of sequences in first cluster and less than 40 in all others). When I checked the sequences of first cluster manually, I found that they are not similar sequences at all. But other clusters except first one seem to be good. I am just wondering if anybody had similar problem. Should I use CD-HIT instead of BlastClust? Is there any better tool for protein Sequence clustering?