CD-HIT Doesn't report the total number of sequences correctly, any known fix?
1
0
Entering edit mode
22 months ago
pcardonap • 0

I am using CD-HIT to reduce redundacy in a dataset of 20405 peptides, CD-HIT seem to work fine but it identifies only 18404 peptides as shown in the output code below:

    ================================================================
Program: CD-HIT, V4.8.1 (+OpenMP), Nov 13 2019, 13:22:53
Command: cd-hit -i BD_final_con_nombres.fasta -o BDpos.fa -c
0.9 -g 1 -T 0 -M 0 -n 5

Started: Mon Dec 16 16:51:57 2019
================================================================
Output
----------------------------------------------------------------
total number of CPUs in the system is 12
Actual number of CPUs to be used: 12

total seq: 18404
longest and shortest : 300 and 11
Total letters: 737624
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 3M
Buffer          : 12 X 10M = 129M
Table           : 2 X 65M = 131M
Miscellaneous   : 0M
Total           : 263M

Table limit with the given memory limit:
Max number of representatives: 744016
Max number of word counting entries: 14908239

# comparing sequences from          0  to       1314
.---------- new table with      840 representatives
# comparing sequences from       1314  to       2534
----------    994 remaining sequences to the next cycle
---------- new table with      187 representatives
# comparing sequences from       1540  to       2744
----------   1023 remaining sequences to the next cycle
---------- new table with      117 representatives
# comparing sequences from       1721  to       2912
----------   1010 remaining sequences to the next cycle
---------- new table with      110 representatives
# comparing sequences from       1902  to       3080
----------    996 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from       2084  to       3249
----------    962 remaining sequences to the next cycle
---------- new table with      123 representatives
# comparing sequences from       2287  to       3438
----------    953 remaining sequences to the next cycle
---------- new table with      116 representatives
# comparing sequences from       2485  to       3622
----------    958 remaining sequences to the next cycle
---------- new table with      117 representatives
# comparing sequences from       2664  to       3788
----------    935 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from       2853  to       3963
----------    932 remaining sequences to the next cycle
---------- new table with      124 representatives
# comparing sequences from       3031  to       4129
----------    891 remaining sequences to the next cycle
---------- new table with      113 representatives
# comparing sequences from       3238  to       4321
----------    700 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from       3621  to       4676
----------    844 remaining sequences to the next cycle
---------- new table with      115 representatives
# comparing sequences from       3832  to       4872
----------    822 remaining sequences to the next cycle
---------- new table with      154 representatives
# comparing sequences from       4050  to       5075
----------    760 remaining sequences to the next cycle
---------- new table with      127 representatives
# comparing sequences from       4315  to       5321
----------    768 remaining sequences to the next cycle
---------- new table with      138 representatives
# comparing sequences from       4553  to       5542
----------    737 remaining sequences to the next cycle
---------- new table with      118 representatives
# comparing sequences from       4805  to       5776
----------    727 remaining sequences to the next cycle
---------- new table with      111 representatives
# comparing sequences from       5049  to       6002
----------    707 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from       5295  to       6231
----------    651 remaining sequences to the next cycle
---------- new table with      127 representatives
# comparing sequences from       5580  to       6496
----------    629 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from       5867  to       6762
----------    563 remaining sequences to the next cycle
---------- new table with      115 representatives
# comparing sequences from       6199  to       7070
----------    585 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from       6485  to       7336
----------    521 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from       6815  to       7642
----------    545 remaining sequences to the next cycle
---------- new table with      116 representatives
# comparing sequences from       7097  to       7904
----------    514 remaining sequences to the next cycle
---------- new table with      127 representatives
# comparing sequences from       7390  to       8176
----------    550 remaining sequences to the next cycle
---------- new table with      110 representatives
# comparing sequences from       7626  to       8395
----------    551 remaining sequences to the next cycle
---------- new table with      123 representatives
# comparing sequences from       7844  to       8598
----------    529 remaining sequences to the next cycle
---------- new table with      118 representatives
# comparing sequences from       8069  to       8807
----------    465 remaining sequences to the next cycle
---------- new table with      139 representatives
# comparing sequences from       8342  to       9060
----------    438 remaining sequences to the next cycle
---------- new table with      140 representatives
# comparing sequences from       8622  to       9320
----------    431 remaining sequences to the next cycle
---------- new table with      130 representatives
# comparing sequences from       8889  to       9568
----------    392 remaining sequences to the next cycle
---------- new table with      117 representatives
# comparing sequences from       9176  to       9835
----------    377 remaining sequences to the next cycle
---------- new table with      114 representatives
# comparing sequences from       9458  to      10097
----------    364 remaining sequences to the next cycle
---------- new table with      130 representatives
# comparing sequences from       9733  to      10352
----------    373 remaining sequences to the next cycle
---------- new table with      122 representatives
# comparing sequences from       9979  to      10580
..........    10000  finished       5044  clusters
----------    326 remaining sequences to the next cycle
---------- new table with      113 representatives
# comparing sequences from      10254  to      10836
----------    296 remaining sequences to the next cycle
---------- new table with      124 representatives
# comparing sequences from      10540  to      11101
----------    285 remaining sequences to the next cycle
---------- new table with      107 representatives
# comparing sequences from      10816  to      11358
----------    260 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from      11098  to      11619
----------    245 remaining sequences to the next cycle
---------- new table with      130 representatives
# comparing sequences from      11374  to      11876
----------    277 remaining sequences to the next cycle
---------- new table with      157 representatives
# comparing sequences from      11599  to      12085
----------    246 remaining sequences to the next cycle
---------- new table with      146 representatives
# comparing sequences from      11839  to      12307
----------    223 remaining sequences to the next cycle
---------- new table with      146 representatives
# comparing sequences from      12084  to      12535
----------    225 remaining sequences to the next cycle
---------- new table with      128 representatives
# comparing sequences from      12310  to      12745
----------    225 remaining sequences to the next cycle
---------- new table with      117 representatives
# comparing sequences from      12520  to      12940
----------    184 remaining sequences to the next cycle
---------- new table with      108 representatives
# comparing sequences from      12756  to      13159
----------    190 remaining sequences to the next cycle
---------- new table with      131 representatives
# comparing sequences from      12969  to      13357
----------    180 remaining sequences to the next cycle
---------- new table with      122 representatives
# comparing sequences from      13177  to      13550
----------    154 remaining sequences to the next cycle
---------- new table with      129 representatives
# comparing sequences from      13396  to      13753
----------    167 remaining sequences to the next cycle
---------- new table with      102 representatives
# comparing sequences from      13586  to      13930
----------    149 remaining sequences to the next cycle
---------- new table with      115 representatives
# comparing sequences from      13781  to      14111
----------    143 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from      13968  to      14284
----------     99 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from      14185  to      14486
----------    112 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from      14374  to      14661
----------     69 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from      14592  to      14864
----------     78 remaining sequences to the next cycle
---------- new table with      118 representatives
# comparing sequences from      14786  to      15044
----------     76 remaining sequences to the next cycle
---------- new table with      115 representatives
# comparing sequences from      14968  to      15213
----------     72 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from      15141  to      15374
----------     51 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from      15323  to      15543
----------     53 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from      15490  to      15698
----------      9 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from      15689  to      15882
....................---------- new table with       89 representatives
# comparing sequences from      15882  to      16062
----------      1 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from      16061  to      16228
----------      2 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from      16226  to      16381
----------     11 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from      16370  to      16515
..................---------- new table with       90 representatives
# comparing sequences from      16515  to      16649
...................---------- new table with       77 representatives
# comparing sequences from      16649  to      16774
..................---------- new table with       73 representatives
# comparing sequences from      16774  to      16890
...................---------- new table with       57 representatives
# comparing sequences from      16890  to      16998
..................---------- new table with       56 representatives
# comparing sequences from      16998  to      17098
..................---------- new table with       59 representatives
# comparing sequences from      17098  to      17191
...................---------- new table with       63 representatives
# comparing sequences from      17191  to      17277
.................---------- new table with       47 representatives
# comparing sequences from      17277  to      17357
................---------- new table with       49 representatives
# comparing sequences from      17357  to      17431
..................---------- new table with       42 representatives
# comparing sequences from      17431  to      18404
.....................---------- new table with      536 representatives

18404  finished       9584  clusters

Approximated maximum memory consumption: 265M
writing new database
writing clustering information
program completed !


I am sure that the fasta is correctly formated in the form:

SEQUENCE

Also the command:

grep -c '>' BD_final_con_nombres.fasta


returns the correct number of peptides

Does anybody know any way to fix this?

software error cd-hit count sequence • 410 views
3
Entering edit mode
22 months ago
Mensur Dlakic ★ 14k

cd-hit ignores sequences shorter than throwaway length, specified by -l switch. That is 10 by default, so I surmise that you have 2001 sequences that fulfill that criterion. Use -l 1 (maybe -l 0) to include all sequences.