Question: CD-HIT Doesn't report the total number of sequences correctly, any known fix?

0

pcardonap •

**0**wrote:I am using CD-HIT to reduce redundacy in a dataset of 20405 peptides, CD-HIT seem to work fine but it identifies only 18404 peptides as shown in the output code below:

```
================================================================
Program: CD-HIT, V4.8.1 (+OpenMP), Nov 13 2019, 13:22:53
Command: cd-hit -i BD_final_con_nombres.fasta -o BDpos.fa -c
0.9 -g 1 -T 0 -M 0 -n 5
Started: Mon Dec 16 16:51:57 2019
================================================================
Output
----------------------------------------------------------------
total number of CPUs in the system is 12
Actual number of CPUs to be used: 12
total seq: 18404
longest and shortest : 300 and 11
Total letters: 737624
Sequences have been sorted
Approximated minimal memory consumption:
Sequence : 3M
Buffer : 12 X 10M = 129M
Table : 2 X 65M = 131M
Miscellaneous : 0M
Total : 263M
Table limit with the given memory limit:
Max number of representatives: 744016
Max number of word counting entries: 14908239
# comparing sequences from 0 to 1314
.---------- new table with 840 representatives
# comparing sequences from 1314 to 2534
---------- 994 remaining sequences to the next cycle
---------- new table with 187 representatives
# comparing sequences from 1540 to 2744
---------- 1023 remaining sequences to the next cycle
---------- new table with 117 representatives
# comparing sequences from 1721 to 2912
---------- 1010 remaining sequences to the next cycle
---------- new table with 110 representatives
# comparing sequences from 1902 to 3080
---------- 996 remaining sequences to the next cycle
---------- new table with 100 representatives
# comparing sequences from 2084 to 3249
---------- 962 remaining sequences to the next cycle
---------- new table with 123 representatives
# comparing sequences from 2287 to 3438
---------- 953 remaining sequences to the next cycle
---------- new table with 116 representatives
# comparing sequences from 2485 to 3622
---------- 958 remaining sequences to the next cycle
---------- new table with 117 representatives
# comparing sequences from 2664 to 3788
---------- 935 remaining sequences to the next cycle
---------- new table with 100 representatives
# comparing sequences from 2853 to 3963
---------- 932 remaining sequences to the next cycle
---------- new table with 124 representatives
# comparing sequences from 3031 to 4129
---------- 891 remaining sequences to the next cycle
---------- new table with 113 representatives
# comparing sequences from 3238 to 4321
---------- 700 remaining sequences to the next cycle
---------- new table with 100 representatives
# comparing sequences from 3621 to 4676
---------- 844 remaining sequences to the next cycle
---------- new table with 115 representatives
# comparing sequences from 3832 to 4872
---------- 822 remaining sequences to the next cycle
---------- new table with 154 representatives
# comparing sequences from 4050 to 5075
---------- 760 remaining sequences to the next cycle
---------- new table with 127 representatives
# comparing sequences from 4315 to 5321
---------- 768 remaining sequences to the next cycle
---------- new table with 138 representatives
# comparing sequences from 4553 to 5542
---------- 737 remaining sequences to the next cycle
---------- new table with 118 representatives
# comparing sequences from 4805 to 5776
---------- 727 remaining sequences to the next cycle
---------- new table with 111 representatives
# comparing sequences from 5049 to 6002
---------- 707 remaining sequences to the next cycle
---------- new table with 100 representatives
# comparing sequences from 5295 to 6231
---------- 651 remaining sequences to the next cycle
---------- new table with 127 representatives
# comparing sequences from 5580 to 6496
---------- 629 remaining sequences to the next cycle
---------- new table with 100 representatives
# comparing sequences from 5867 to 6762
---------- 563 remaining sequences to the next cycle
---------- new table with 115 representatives
# comparing sequences from 6199 to 7070
---------- 585 remaining sequences to the next cycle
---------- new table with 100 representatives
# comparing sequences from 6485 to 7336
---------- 521 remaining sequences to the next cycle
---------- new table with 100 representatives
# comparing sequences from 6815 to 7642
---------- 545 remaining sequences to the next cycle
---------- new table with 116 representatives
# comparing sequences from 7097 to 7904
---------- 514 remaining sequences to the next cycle
---------- new table with 127 representatives
# comparing sequences from 7390 to 8176
---------- 550 remaining sequences to the next cycle
---------- new table with 110 representatives
# comparing sequences from 7626 to 8395
---------- 551 remaining sequences to the next cycle
---------- new table with 123 representatives
# comparing sequences from 7844 to 8598
---------- 529 remaining sequences to the next cycle
---------- new table with 118 representatives
# comparing sequences from 8069 to 8807
---------- 465 remaining sequences to the next cycle
---------- new table with 139 representatives
# comparing sequences from 8342 to 9060
---------- 438 remaining sequences to the next cycle
---------- new table with 140 representatives
# comparing sequences from 8622 to 9320
---------- 431 remaining sequences to the next cycle
---------- new table with 130 representatives
# comparing sequences from 8889 to 9568
---------- 392 remaining sequences to the next cycle
---------- new table with 117 representatives
# comparing sequences from 9176 to 9835
---------- 377 remaining sequences to the next cycle
---------- new table with 114 representatives
# comparing sequences from 9458 to 10097
---------- 364 remaining sequences to the next cycle
---------- new table with 130 representatives
# comparing sequences from 9733 to 10352
---------- 373 remaining sequences to the next cycle
---------- new table with 122 representatives
# comparing sequences from 9979 to 10580
.......... 10000 finished 5044 clusters
---------- 326 remaining sequences to the next cycle
---------- new table with 113 representatives
# comparing sequences from 10254 to 10836
---------- 296 remaining sequences to the next cycle
---------- new table with 124 representatives
# comparing sequences from 10540 to 11101
---------- 285 remaining sequences to the next cycle
---------- new table with 107 representatives
# comparing sequences from 10816 to 11358
---------- 260 remaining sequences to the next cycle
---------- new table with 100 representatives
# comparing sequences from 11098 to 11619
---------- 245 remaining sequences to the next cycle
---------- new table with 130 representatives
# comparing sequences from 11374 to 11876
---------- 277 remaining sequences to the next cycle
---------- new table with 157 representatives
# comparing sequences from 11599 to 12085
---------- 246 remaining sequences to the next cycle
---------- new table with 146 representatives
# comparing sequences from 11839 to 12307
---------- 223 remaining sequences to the next cycle
---------- new table with 146 representatives
# comparing sequences from 12084 to 12535
---------- 225 remaining sequences to the next cycle
---------- new table with 128 representatives
# comparing sequences from 12310 to 12745
---------- 225 remaining sequences to the next cycle
---------- new table with 117 representatives
# comparing sequences from 12520 to 12940
---------- 184 remaining sequences to the next cycle
---------- new table with 108 representatives
# comparing sequences from 12756 to 13159
---------- 190 remaining sequences to the next cycle
---------- new table with 131 representatives
# comparing sequences from 12969 to 13357
---------- 180 remaining sequences to the next cycle
---------- new table with 122 representatives
# comparing sequences from 13177 to 13550
---------- 154 remaining sequences to the next cycle
---------- new table with 129 representatives
# comparing sequences from 13396 to 13753
---------- 167 remaining sequences to the next cycle
---------- new table with 102 representatives
# comparing sequences from 13586 to 13930
---------- 149 remaining sequences to the next cycle
---------- new table with 115 representatives
# comparing sequences from 13781 to 14111
---------- 143 remaining sequences to the next cycle
---------- new table with 100 representatives
# comparing sequences from 13968 to 14284
---------- 99 remaining sequences to the next cycle
---------- new table with 100 representatives
# comparing sequences from 14185 to 14486
---------- 112 remaining sequences to the next cycle
---------- new table with 100 representatives
# comparing sequences from 14374 to 14661
---------- 69 remaining sequences to the next cycle
---------- new table with 100 representatives
# comparing sequences from 14592 to 14864
---------- 78 remaining sequences to the next cycle
---------- new table with 118 representatives
# comparing sequences from 14786 to 15044
---------- 76 remaining sequences to the next cycle
---------- new table with 115 representatives
# comparing sequences from 14968 to 15213
---------- 72 remaining sequences to the next cycle
---------- new table with 100 representatives
# comparing sequences from 15141 to 15374
---------- 51 remaining sequences to the next cycle
---------- new table with 100 representatives
# comparing sequences from 15323 to 15543
---------- 53 remaining sequences to the next cycle
---------- new table with 100 representatives
# comparing sequences from 15490 to 15698
---------- 9 remaining sequences to the next cycle
---------- new table with 100 representatives
# comparing sequences from 15689 to 15882
....................---------- new table with 89 representatives
# comparing sequences from 15882 to 16062
---------- 1 remaining sequences to the next cycle
---------- new table with 100 representatives
# comparing sequences from 16061 to 16228
---------- 2 remaining sequences to the next cycle
---------- new table with 100 representatives
# comparing sequences from 16226 to 16381
---------- 11 remaining sequences to the next cycle
---------- new table with 100 representatives
# comparing sequences from 16370 to 16515
..................---------- new table with 90 representatives
# comparing sequences from 16515 to 16649
...................---------- new table with 77 representatives
# comparing sequences from 16649 to 16774
..................---------- new table with 73 representatives
# comparing sequences from 16774 to 16890
...................---------- new table with 57 representatives
# comparing sequences from 16890 to 16998
..................---------- new table with 56 representatives
# comparing sequences from 16998 to 17098
..................---------- new table with 59 representatives
# comparing sequences from 17098 to 17191
...................---------- new table with 63 representatives
# comparing sequences from 17191 to 17277
.................---------- new table with 47 representatives
# comparing sequences from 17277 to 17357
................---------- new table with 49 representatives
# comparing sequences from 17357 to 17431
..................---------- new table with 42 representatives
# comparing sequences from 17431 to 18404
.....................---------- new table with 536 representatives
18404 finished 9584 clusters
Approximated maximum memory consumption: 265M
writing new database
writing clustering information
program completed !
```

I am sure that the fasta is correctly formated in the form:

>header

SEQUENCE

Also the command:

```
grep -c '>' BD_final_con_nombres.fasta
```

returns the correct number of peptides

Does anybody know any way to fix this?

ADD COMMENT
• link
•
modified 11 months ago
by
Mensur Dlakic •

**7.1k**• written 11 months ago by pcardonap •**0**