Question

CDHIT para and cdhit slow speed

0

Entering edit mode

9 months ago

ahteshamabbasi1996 ▴ 10

I am writing to seek assistance regarding the usage of CD-HIT software for clustering a dataset of 135,000 nucleotide sequences. Currently, I am working on a cluster with 16 CPUs, and the maximum time limit available on this cluster is one week.

I have attempted to improve the performance of the CD-HIT process by employing various parameters, such as -T and -M, but unfortunately, none of them have proven to accelerate the execution time significantly.

Moreover, I have come across the option of using cd-hit-para, but I am uncertain about its usage since it requires specifying IP addresses, whereas I only have information regarding the number of CPUs available on the cluster.

I would greatly appreciate any assistance or guidance you can provide to help optimize the CD-HIT clustering process in my current setup.

Thank you in advance for your support.

processing cdhit-para DNA cdhit • 1.1k views

ADD COMMENT • link updated 8 months ago by meyezili • 0 • written 9 months ago by ahteshamabbasi1996 ▴ 10

0

Entering edit mode

Maybe not a solution but have you thought about using other clustering tools (e.g. UCLUST or MMseqs2)?

ADD REPLY • link 9 months ago by biofalconch ★ 1.1k

score 0 · Answer 1 · 2023-07-14

0

Entering edit mode

9 months ago

Mensur Dlakic ★ 27k

I think cd-hit-est is used for DNA sequences, but you should verify.

It is not just how many sequences you have - 135,000 is actually a small(ish) number - but how long they are. If they are prokaryotic genome-size long, or worse yet if they are eukaryotic chromosomes, it might take a very long time regardless of how many CPUs are available. Generally speaking, cd-hit is not meant for very long sequences.

ADD COMMENT • link 9 months ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Yes, you are right.

In my case, cd-hit-est is being used for nucleotide sequences.

These sequences are actually long eccDNA sequences belonging to HM. But even in 7 days, it is able to cluster only a few thousand sequences.

I am not sure what would be the role of -T, and -M parameters, if they can't provide any sort of acceleration to the clustering process.

ADD REPLY • link 9 months ago by ahteshamabbasi1996 ▴ 10

score 0 · Answer 2 · 2023-08-28

You can find some explanations for parameters in http://www.bioinformatics.org/cd-hit/cd-hit-user-guide.

By setting the CD-HIT parameter -T 0, all CPUs defined in the SLURM script will be used. Setting the parameter -M 0 allows unlimited usage of the available memory.

-T functions well in my work.