Question: CD-HIT Doesn't report the total number of sequences correctly, any known fix?
0
gravatar for pcardonap
11 months ago by
pcardonap0
pcardonap0 wrote:

I am using CD-HIT to reduce redundacy in a dataset of 20405 peptides, CD-HIT seem to work fine but it identifies only 18404 peptides as shown in the output code below:

    ================================================================
Program: CD-HIT, V4.8.1 (+OpenMP), Nov 13 2019, 13:22:53
Command: cd-hit -i BD_final_con_nombres.fasta -o BDpos.fa -c
         0.9 -g 1 -T 0 -M 0 -n 5

Started: Mon Dec 16 16:51:57 2019
================================================================
                            Output                              
----------------------------------------------------------------
total number of CPUs in the system is 12
Actual number of CPUs to be used: 12

total seq: 18404
longest and shortest : 300 and 11
Total letters: 737624
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 3M
Buffer          : 12 X 10M = 129M
Table           : 2 X 65M = 131M
Miscellaneous   : 0M
Total           : 263M

Table limit with the given memory limit:
Max number of representatives: 744016
Max number of word counting entries: 14908239

# comparing sequences from          0  to       1314
.---------- new table with      840 representatives
# comparing sequences from       1314  to       2534
----------    994 remaining sequences to the next cycle
---------- new table with      187 representatives
# comparing sequences from       1540  to       2744
----------   1023 remaining sequences to the next cycle
---------- new table with      117 representatives
# comparing sequences from       1721  to       2912
----------   1010 remaining sequences to the next cycle
---------- new table with      110 representatives
# comparing sequences from       1902  to       3080
----------    996 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from       2084  to       3249
----------    962 remaining sequences to the next cycle
---------- new table with      123 representatives
# comparing sequences from       2287  to       3438
----------    953 remaining sequences to the next cycle
---------- new table with      116 representatives
# comparing sequences from       2485  to       3622
----------    958 remaining sequences to the next cycle
---------- new table with      117 representatives
# comparing sequences from       2664  to       3788
----------    935 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from       2853  to       3963
----------    932 remaining sequences to the next cycle
---------- new table with      124 representatives
# comparing sequences from       3031  to       4129
----------    891 remaining sequences to the next cycle
---------- new table with      113 representatives
# comparing sequences from       3238  to       4321
----------    700 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from       3621  to       4676
----------    844 remaining sequences to the next cycle
---------- new table with      115 representatives
# comparing sequences from       3832  to       4872
----------    822 remaining sequences to the next cycle
---------- new table with      154 representatives
# comparing sequences from       4050  to       5075
----------    760 remaining sequences to the next cycle
---------- new table with      127 representatives
# comparing sequences from       4315  to       5321
----------    768 remaining sequences to the next cycle
---------- new table with      138 representatives
# comparing sequences from       4553  to       5542
----------    737 remaining sequences to the next cycle
---------- new table with      118 representatives
# comparing sequences from       4805  to       5776
----------    727 remaining sequences to the next cycle
---------- new table with      111 representatives
# comparing sequences from       5049  to       6002
----------    707 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from       5295  to       6231
----------    651 remaining sequences to the next cycle
---------- new table with      127 representatives
# comparing sequences from       5580  to       6496
----------    629 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from       5867  to       6762
----------    563 remaining sequences to the next cycle
---------- new table with      115 representatives
# comparing sequences from       6199  to       7070
----------    585 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from       6485  to       7336
----------    521 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from       6815  to       7642
----------    545 remaining sequences to the next cycle
---------- new table with      116 representatives
# comparing sequences from       7097  to       7904
----------    514 remaining sequences to the next cycle
---------- new table with      127 representatives
# comparing sequences from       7390  to       8176
----------    550 remaining sequences to the next cycle
---------- new table with      110 representatives
# comparing sequences from       7626  to       8395
----------    551 remaining sequences to the next cycle
---------- new table with      123 representatives
# comparing sequences from       7844  to       8598
----------    529 remaining sequences to the next cycle
---------- new table with      118 representatives
# comparing sequences from       8069  to       8807
----------    465 remaining sequences to the next cycle
---------- new table with      139 representatives
# comparing sequences from       8342  to       9060
----------    438 remaining sequences to the next cycle
---------- new table with      140 representatives
# comparing sequences from       8622  to       9320
----------    431 remaining sequences to the next cycle
---------- new table with      130 representatives
# comparing sequences from       8889  to       9568
----------    392 remaining sequences to the next cycle
---------- new table with      117 representatives
# comparing sequences from       9176  to       9835
----------    377 remaining sequences to the next cycle
---------- new table with      114 representatives
# comparing sequences from       9458  to      10097
----------    364 remaining sequences to the next cycle
---------- new table with      130 representatives
# comparing sequences from       9733  to      10352
----------    373 remaining sequences to the next cycle
---------- new table with      122 representatives
# comparing sequences from       9979  to      10580
..........    10000  finished       5044  clusters
----------    326 remaining sequences to the next cycle
---------- new table with      113 representatives
# comparing sequences from      10254  to      10836
----------    296 remaining sequences to the next cycle
---------- new table with      124 representatives
# comparing sequences from      10540  to      11101
----------    285 remaining sequences to the next cycle
---------- new table with      107 representatives
# comparing sequences from      10816  to      11358
----------    260 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from      11098  to      11619
----------    245 remaining sequences to the next cycle
---------- new table with      130 representatives
# comparing sequences from      11374  to      11876
----------    277 remaining sequences to the next cycle
---------- new table with      157 representatives
# comparing sequences from      11599  to      12085
----------    246 remaining sequences to the next cycle
---------- new table with      146 representatives
# comparing sequences from      11839  to      12307
----------    223 remaining sequences to the next cycle
---------- new table with      146 representatives
# comparing sequences from      12084  to      12535
----------    225 remaining sequences to the next cycle
---------- new table with      128 representatives
# comparing sequences from      12310  to      12745
----------    225 remaining sequences to the next cycle
---------- new table with      117 representatives
# comparing sequences from      12520  to      12940
----------    184 remaining sequences to the next cycle
---------- new table with      108 representatives
# comparing sequences from      12756  to      13159
----------    190 remaining sequences to the next cycle
---------- new table with      131 representatives
# comparing sequences from      12969  to      13357
----------    180 remaining sequences to the next cycle
---------- new table with      122 representatives
# comparing sequences from      13177  to      13550
----------    154 remaining sequences to the next cycle
---------- new table with      129 representatives
# comparing sequences from      13396  to      13753
----------    167 remaining sequences to the next cycle
---------- new table with      102 representatives
# comparing sequences from      13586  to      13930
----------    149 remaining sequences to the next cycle
---------- new table with      115 representatives
# comparing sequences from      13781  to      14111
----------    143 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from      13968  to      14284
----------     99 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from      14185  to      14486
----------    112 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from      14374  to      14661
----------     69 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from      14592  to      14864
----------     78 remaining sequences to the next cycle
---------- new table with      118 representatives
# comparing sequences from      14786  to      15044
----------     76 remaining sequences to the next cycle
---------- new table with      115 representatives
# comparing sequences from      14968  to      15213
----------     72 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from      15141  to      15374
----------     51 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from      15323  to      15543
----------     53 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from      15490  to      15698
----------      9 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from      15689  to      15882
....................---------- new table with       89 representatives
# comparing sequences from      15882  to      16062
----------      1 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from      16061  to      16228
----------      2 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from      16226  to      16381
----------     11 remaining sequences to the next cycle
---------- new table with      100 representatives
# comparing sequences from      16370  to      16515
..................---------- new table with       90 representatives
# comparing sequences from      16515  to      16649
...................---------- new table with       77 representatives
# comparing sequences from      16649  to      16774
..................---------- new table with       73 representatives
# comparing sequences from      16774  to      16890
...................---------- new table with       57 representatives
# comparing sequences from      16890  to      16998
..................---------- new table with       56 representatives
# comparing sequences from      16998  to      17098
..................---------- new table with       59 representatives
# comparing sequences from      17098  to      17191
...................---------- new table with       63 representatives
# comparing sequences from      17191  to      17277
.................---------- new table with       47 representatives
# comparing sequences from      17277  to      17357
................---------- new table with       49 representatives
# comparing sequences from      17357  to      17431
..................---------- new table with       42 representatives
# comparing sequences from      17431  to      18404
.....................---------- new table with      536 representatives

    18404  finished       9584  clusters

Approximated maximum memory consumption: 265M
writing new database
writing clustering information
program completed !

I am sure that the fasta is correctly formated in the form:

>header

SEQUENCE

Also the command:

grep -c '>' BD_final_con_nombres.fasta

returns the correct number of peptides

Does anybody know any way to fix this?

ADD COMMENTlink modified 11 months ago by Mensur Dlakic7.1k • written 11 months ago by pcardonap0
3
gravatar for Mensur Dlakic
11 months ago by
Mensur Dlakic7.1k
USA
Mensur Dlakic7.1k wrote:

cd-hit ignores sequences shorter than throwaway length, specified by -l switch. That is 10 by default, so I surmise that you have 2001 sequences that fulfill that criterion. Use -l 1 (maybe -l 0) to include all sequences.

ADD COMMENTlink written 11 months ago by Mensur Dlakic7.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1945 users visited in the last hour