Clustering Genes, Getting The Full Identifier

1

Entering edit mode

11.1 years ago

Lee Katz ★ 3.1k

I am clustering genes with CD-HIT, and I can't find these answers in the manual. Question 1: is the best way to cluster genes with the executable cd-hit-est?

Question 2: how do I not truncate identifiers in the output file? Example below. I would like to extract the sequences after clustering, and it would be difficult in my workflow to rename the genes to something shorter.

>Cluster 0
0       8262nt, >AM263198_am263198_c... *
>Cluster 1
0       6669nt, >CP001175_cp001175_c... *
>Cluster 2
0       277nt, >ADXE00000000_adxe01... at +/97.83%
1       6597nt, >CP001175_cp001175_c... *
>Cluster 3
0       771nt, >ADXH00000000_adxh01... at +/95.07%
1       6588nt, >AE017262_ae017262_c... *
2       6588nt, >FM242711_fm242711_c... at +/99.61%

• 2.0k views

ADD COMMENT • link 11.1 years ago by Lee Katz ★ 3.1k

0

Entering edit mode

Nevermind to the second question. The answer is to use -d 100 (or other big number).

ADD REPLY • link 11.1 years ago by Lee Katz ★ 3.1k

Login before adding your answer.