Cut down CD-HIT-EST time
1
0
Entering edit mode
18 months ago
Neeraja • 0

Hello,

I am trying to run my fasta file with ~1.98M sequences with 80% threshold on CD-HIT-EST and it seems to take a really long time (more than what my supercomputing cluster would allow which is 14 days). I am running it at max memory and cores (2.9TB, 80 cores). I have read here that a step down approach could reduce run time? For example, running my initial fasta at 95% threshold, then 90%, 85% and lastly 80%, each time using the CD-HIT output fasta from the previous run as input.

Is this a feasible way? Are there other option for clustering ~1.98M nucleotide sequences at 80% threshold much faster than this?

Thanks!

nucleotide fasta CD-HIT CD-HIT-EST • 1.7k views
ADD COMMENT
1
Entering edit mode

vsearch is a great cd-hit alternative, but like Mensur Dlakic commented, it sounds like there's something wrong with your command..

ADD REPLY
2
Entering edit mode
18 months ago
Mensur Dlakic ★ 27k

What you want to do is a viable approach.

However, there should be no problem, nor should it take weeks, to cluster ~2 million sequences, unless they are very long (millions of bases). I would look into the command more closely and whether it is actually using all the cores. In fact, showing your command and its output (or log file) should be a default behavior any time a question is asked, as having that information never hurts.

ADD COMMENT
0
Entering edit mode

Thanks for the replies. This is my command-

"${cd_hit_est}" -o /fs/PAS/Neeraja_MOLD/CD_HIT_EST_1.fasta -c 0.8 -i /fs/PAS/Neeraja_MOLD/Trinity.fasta -n 4 -T 0 -M 0

This is the beginning (and a part of the end) of the output, it has been running for over 24 hours and has processed 90000 sequences-

total number of CPUs in the system is 80
Actual number of CPUs to be used: 80

total seq: 1983474
longest and shortest : 42016 and 269
Total letters: 1586207764
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 1851M
Buffer          : 80 X 44M = 3583M
Table           : 2 X 31M = 63M
Miscellaneous   : 26M
Total           : 5525M

Table limit with the given memory limit:
Max number of representatives: 248606
Max number of word counting entries: 35333738


# comparing sequences from          0  to        498
---------- new table with      228 representatives
.
.
.
.
# comparing sequences from      67094  to      90464
....................
..........    70000  finished      28085  clusters
......
..........    80000  finished      32077  clusters
.......
..........    90000  finished      36093  clusters
ADD REPLY
2
Entering edit mode

The clustering will be slowest at the beginning, as it first goes through longest sequences. Still, with this database size, it should not take weeks like you originally suggested. Were you projecting that based on the fact that it took a day for 100000 sequences? Clustering will speed up rapidly when it gets to shorter sequences, so you can't estimate based on first 100K sequences.

Also, you will get faster clustering with -n 5 instead of -n 4 for 80% threshold. I suggest you specify both the memory and cores explicitly, like -T 20 -M 100000. You may put different number of threads, but you should not need more than 100 Gb for this dataset.

ADD REPLY
0
Entering edit mode

Okay, great thank you! Yes, I think I projected based on the first 90k sequences and that was the mistake in my estimation. This is great to know, I will use what you suggested!

ADD REPLY
0
Entering edit mode

-T 0

Hopefully that is a typo or is that how you specify "use all cores available"?

ADD REPLY
0
Entering edit mode

Not a typo and yes, -T 0 is for all cores available.

ADD REPLY

Login before adding your answer.

Traffic: 3741 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6