CD-HIT Doesn't report the total number of sequences correctly, any known fix?
22 months ago
pcardonap • 0

I am using CD-HIT to reduce redundacy in a dataset of 20405 peptides, CD-HIT seem to work fine but it identifies only 18404 peptides as shown in the output code below:

    ================================================================
Program: CD-HIT, V4.8.1 (+OpenMP), Nov 13 2019, 13:22:53
Command: cd-hit -i BD_final_con_nombres.fasta -o BDpos.fa -c
0.9 -g 1 -T 0 -M 0 -n 5

Started: Mon Dec 16 16:51:57 2019
================================================================
Output
----------------------------------------------------------------
total number of CPUs in the system is 12
Actual number of CPUs to be used: 12

total seq: 18404
longest and shortest : 300 and 11
Total letters: 737624
Sequences have been sorted

18404  finished       9584  clusters

Approximated maximum memory consumption: 265M
writing new database
writing clustering information
program completed !


I am sure that the fasta is correctly formated in the form:

SEQUENCE

Also the command:

grep -c '>' BD_final_con_nombres.fasta


returns the correct number of peptides

Does anybody know any way to fix this?

3
Entering edit mode
22 months ago
Mensur Dlakic ★ 14k

cd-hit ignores sequences shorter than throwaway length, specified by -l switch. That is 10 by default, so I surmise that you have 2001 sequences that fulfill that criterion. Use -l 1 (maybe -l 0) to include all sequences.