Entering edit mode
11.6 years ago
Lee Katz
★
3.2k
I am clustering genes with CD-HIT, and I can't find these answers in the manual. Question 1: is the best way to cluster genes with the executable cd-hit-est?
Question 2: how do I not truncate identifiers in the output file? Example below. I would like to extract the sequences after clustering, and it would be difficult in my workflow to rename the genes to something shorter.
>Cluster 0
0 8262nt, >AM263198_am263198_c... *
>Cluster 1
0 6669nt, >CP001175_cp001175_c... *
>Cluster 2
0 277nt, >ADXE00000000_adxe01... at +/97.83%
1 6597nt, >CP001175_cp001175_c... *
>Cluster 3
0 771nt, >ADXH00000000_adxh01... at +/95.07%
1 6588nt, >AE017262_ae017262_c... *
2 6588nt, >FM242711_fm242711_c... at +/99.61%
Nevermind to the second question. The answer is to use -d 100 (or other big number).