how to speed up cdhit clustering?
Entering edit mode
5.6 years ago
bitpir ▴ 240

I'm trying to run CDHIT to cluster ~250M of cds at nucleotide/protein levels. These are mostly NR-like sequences from NCBI. According to the paper it takes ~ 140 mins to cluster 4M seqs with 8 core. When I run the job, it took > 12 hours to process 1M seqs. I've tried increasing the #cpu to 24 but it still doesn't change the speed that much. Below are the commands that I used for running the clustering. Any help is appreciated! Thanks!

cd-hit-v4.6.8-2017-1208/cd-hit-est -I f1.nuc -o f1.nuc.out -n 10 -M 0 -T 8 -c 0.95 -r 0
cd-hit-v4.6.8-2017-1208/cd-hit -I f1.pep -o f1.pep.out -n 5 -M 0 -T 8 -c 0.95
cdhit protein clustering nucleotide clustering • 1.6k views

Login before adding your answer.

Traffic: 2190 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6