Question

how to increase cd-hit capacity

0

Entering edit mode

4.7 years ago

saber mohammadi ▴ 20

Hello, everybody I have a fasta file which includes 441648 sequences. I'm running cdhit on this amount of sequences. But I get no output. I think it's because my fasta file is too big and cdhit can't handle this amount of data. Am I right? If I am so is there any way that I could increase cdhit capacity to force cdhit to manage this amount of data??

I'll be appreciated if anyone could help me with this.

cdhit seqeunce identity • 2.4k views

ADD COMMENT • link 4.7 years ago by saber mohammadi ▴ 20

3

Entering edit mode

This should be a manageable number of sequences for cd-hit to handle - I have done millions of proteins without any problem. It would help if you share the exact command, the whole output, and your computer configuration - especially the memory.

Also, you may want to give a look to MMseqs2.

ADD REPLY • link 4.7 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Thanks for responding.

cd-hit -i ../TP_MP_Project/Mesophiles_final.fasta -o ../TP_MP_Project/cdhit_output -c 0.4 -n 2 -M 16000 -T 8 >& log &

This is the command I run. I got no output. Actually I run it on a Linux server, after a while I realize that cdhit is not running and also there is no "cdhit_output" file.

Following is the server config:

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 36 bits physical, 48 bits virtual CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 42 Model name: Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz Stepping: 7 CPU MHz: 3623.375 CPU max MHz: 3800.0000 CPU min MHz: 1600.0000 BogoMIPS: 6823.04 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 8192K NUMA node0 CPU(s): 0-7 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d

ADD REPLY • link 4.7 years ago by saber mohammadi ▴ 20

2

Entering edit mode

The output I was asking about would be the same command you listed above but without redirection (no >& log at the end). Clustering at 40% identity is a slow process and may not generate enough output (~4K) for anything to be placed into your log file. Not sure whether you killed the program or it just quit on its own - that will be easier to determine without redirection.

I suggest you make sure first that cd-hit is working by clustering at high identity:

cd-hit -i ../TP_MP_Project/Mesophiles_final.fasta -o ../TP_MP_Project/cdhit_output.90 -c 0.9 -M 16000 -T 8

If that works, you can use cdhit_output.90 as your input for clustering at 40%, as that will cut down the time.

ADD REPLY • link 4.7 years ago by Mensur Dlakic ★ 27k

1

Entering edit mode

any reason why you are using that specific set of parameters? what is the goal?

as Mensur Dlakic already kinda pointed out, I think it might be due to your parameter setting. Moreover, you will find very little meaningful results with those parameters? identity of 40% and word-size of 2 ? I personally never go any lower than identity 80% or such, your clusters still have to make sense in the end.

ADD REPLY • link 4.7 years ago by lieven.sterck 15k

0

Entering edit mode

I'm trying to make a dataset of mesophile proteins. I'm going to use cd-hit representative sequences as a dataset in machine learning methods. So they shouldn't be similar to avoid redundant information during the learning step. I set word-size=2 because I read that on cd-hit user's guide: cd-hit user's guide It has been mentioned that you should set word size=2 for thresholds 0.4 ~ 0.5. By the way, I've chosen cutoff=40 because I found out that the articles have used cutoff=40 for dataset construction.

ADD REPLY • link 4.7 years ago by saber mohammadi ▴ 20