CDHIT para and cdhit slow speed
2
0
Entering edit mode
9 months ago

I am writing to seek assistance regarding the usage of CD-HIT software for clustering a dataset of 135,000 nucleotide sequences. Currently, I am working on a cluster with 16 CPUs, and the maximum time limit available on this cluster is one week.

I have attempted to improve the performance of the CD-HIT process by employing various parameters, such as -T and -M, but unfortunately, none of them have proven to accelerate the execution time significantly.

Moreover, I have come across the option of using cd-hit-para, but I am uncertain about its usage since it requires specifying IP addresses, whereas I only have information regarding the number of CPUs available on the cluster.

I would greatly appreciate any assistance or guidance you can provide to help optimize the CD-HIT clustering process in my current setup.

Thank you in advance for your support.

processing cdhit-para DNA cdhit • 1.1k views
ADD COMMENT
0
Entering edit mode

Maybe not a solution but have you thought about using other clustering tools (e.g. UCLUST or MMseqs2)?

ADD REPLY
0
Entering edit mode
9 months ago
Mensur Dlakic ★ 27k

I think cd-hit-est is used for DNA sequences, but you should verify.

It is not just how many sequences you have - 135,000 is actually a small(ish) number - but how long they are. If they are prokaryotic genome-size long, or worse yet if they are eukaryotic chromosomes, it might take a very long time regardless of how many CPUs are available. Generally speaking, cd-hit is not meant for very long sequences.

ADD COMMENT
0
Entering edit mode

Yes, you are right.

In my case, cd-hit-est is being used for nucleotide sequences.

These sequences are actually long eccDNA sequences belonging to HM. But even in 7 days, it is able to cluster only a few thousand sequences.

I am not sure what would be the role of -T, and -M parameters, if they can't provide any sort of acceleration to the clustering process.

ADD REPLY
0
Entering edit mode
8 months ago
meyezili • 0

You can find some explanations for parameters in http://www.bioinformatics.org/cd-hit/cd-hit-user-guide.

By setting the CD-HIT parameter -T 0, all CPUs defined in the SLURM script will be used. Setting the parameter -M 0 allows unlimited usage of the available memory.

-T functions well in my work.

ADD COMMENT

Login before adding your answer.

Traffic: 1920 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6