Question: Protein Sequence Clustering (CD-HIT vs BLASTClust + Problems in BlastClust)
0
gravatar for sbdk82
4.3 years ago by
sbdk8260
United States
sbdk8260 wrote:

I used both BlastClust and CD-HIT for clustering protein sequences. The output of CD-HIT looks good but it is creating too many clusters (6000 clusters for 18000 protein sequences). When I used Blastclust, I got less number of sequences but I found a major bug. The first cluster is always contain maximum number of sequences and all others contain very very less number of sequences (1405 number of sequences in first cluster and less than 40 in all others). When I checked the sequences of first cluster manually, I found that they are not similar sequences at all. But other clusters except first one seem to be good. I am just wondering if anybody had similar problem. Should I use CD-HIT instead of BlastClust? Is there any better tool for protein Sequence clustering?

 

ADD COMMENTlink modified 4.3 years ago • written 4.3 years ago by sbdk8260

Have you tried to simply increase the identity threshold when using CDhit?

Also, what is the biological reasoning behind clustering? It sound like these sequences are already highly similar. Are you trying to divide them by certain mutation or something?

ADD REPLYlink written 4.3 years ago by David Westergaard1.4k

 Yes, I ran CD-HIT with Sequence identity threshold 0.9, 0.8, 0.7, 0.6, 0.5 and 0.4. I am still getting large number of clusters.  I am clustering the homologus genes of three different viruses. 

ADD REPLYlink written 4.3 years ago by sbdk8260
0
gravatar for h.mon
4.3 years ago by
h.mon29k
Brazil
h.mon29k wrote:

For amino-acids, a faster alternative is USEARCH. BLASTClust is from the older and deprecated BLAST suite, so few chances of bugfixes (and I am not certain you found a bug).

You do not provide the commands you executed, nor examples of the dissimilar sequences, so I have no clue if your results are good or not.

edit: your post does not seem fit to the "tool" category, it is a regular question.

ADD COMMENTlink modified 4.3 years ago • written 4.3 years ago by h.mon29k
0
gravatar for sbdk82
4.3 years ago by
sbdk8260
United States
sbdk8260 wrote:

For BlastClust, I tried both command line (blastclust -i infile -o outfile -p T -L 1 -b T -S 75 )  and webserver (http://toolkit.tuebingen.mpg.de/blastclust). I also took all sequences of first cluster and run blastclust for them, I found similar output i.e. 700 sequences in first cluster and less than 20 in all other sequences. I took two random sequences from the first cluster and did the pairwise alignment and they are not similar. 

I will try using USEARCH and see if that gives some good result. 

ADD COMMENTlink written 4.3 years ago by sbdk8260
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 875 users visited in the last hour