Clustering Genes, Getting The Full Identifier
0
1
Entering edit mode
11.1 years ago
Lee Katz ★ 3.1k

I am clustering genes with CD-HIT, and I can't find these answers in the manual. Question 1: is the best way to cluster genes with the executable cd-hit-est?

Question 2: how do I not truncate identifiers in the output file? Example below. I would like to extract the sequences after clustering, and it would be difficult in my workflow to rename the genes to something shorter.

>Cluster 0
0       8262nt, >AM263198_am263198_c... *
>Cluster 1
0       6669nt, >CP001175_cp001175_c... *
>Cluster 2
0       277nt, >ADXE00000000_adxe01... at +/97.83%
1       6597nt, >CP001175_cp001175_c... *
>Cluster 3
0       771nt, >ADXH00000000_adxh01... at +/95.07%
1       6588nt, >AE017262_ae017262_c... *
2       6588nt, >FM242711_fm242711_c... at +/99.61%
• 2.0k views
ADD COMMENT
0
Entering edit mode

Nevermind to the second question. The answer is to use -d 100 (or other big number).

ADD REPLY

Login before adding your answer.

Traffic: 2004 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6