Question

cdhit clustering include full name in cluster file

0

Entering edit mode

5.1 years ago

Sean R Johnson ▴ 120

Is there any way to trace which sequences are included in each cd-hit cluster?

I'm aware of the .clstr file that gets generated, but the names are truncated, and I can't figure out a way to get the whole names.

For example, here is a snippet of a cluster file from a recent cd-hit run:

>Cluster 100
0       570aa, >ncbi|1353254|Penici... at 86.67%
1       570aa, >ncbi|500485|Penicil... at 87.02%
2       486aa, >ncbi|1170229|Penici... at 91.36%
3       486aa, >ncbi|1170230|Penici... at 91.36%
4       486aa, >ncbi|1170230|Penici... at 91.36%
5       572aa, >ncbi|27334|Penicill... at 99.65%
6       1967aa, >ncbi|27334|Penicill... *
7       570aa, >ncbi|5078|Penicilli... at 87.54%
8       570aa, >ncbi|5078|Penicilli... at 87.54%
9       1967aa, >ncbi|40296|Penicill... at 96.24%
10      1967aa, >ncbi|40296|Penicill... at 96.24%
11      570aa, >ncbi|1346256|Penici... at 85.09%
12      570aa, >ncbi|1439352|Penici... at 86.67%
13      572aa, >ncbi|60172|Penicill... at 92.13%
14      572aa, >ncbi|60172|Penicill... at 92.48%
15      571aa, >ncbi|2136024|Penici... at 90.72%
16      572aa, >ncbi|293382|Penicil... at 91.43%

I would like to make a fasta file from just the sequences in this cluster, but I can't because the part of the name shown is not enough to uniquely identify the sequences.

The obvious solution would be to rename the sequences before running cd-hit, then rename them back afterwards, but it seems like there should be a more direct way.

cdhit sequence clustering • 3.4k views

ADD COMMENT • link 5.1 years ago by Sean R Johnson ▴ 120

0

Entering edit mode

Where did you get the sequences from? If 1353254 refers to gi number you should be able to get the sequence from NCBI using Entrezdirect.

ADD REPLY • link 5.1 years ago by GenoMax 141k

0

Entering edit mode

They aren't from genbank. They are gene models from genome assemblies. The ncbi|[0-9]+| indicates NCBI Taxonomy ID. I do that to be compatible with bbmap taxonomy tools.

ADD REPLY • link 5.1 years ago by Sean R Johnson ▴ 120

score 1 · Answer 1 · 2019-02-26

1

Entering edit mode

5.1 years ago

GenoMax 141k

I see this option in README

-d  length of description in .clstr file, default 20    if set to 0, it takes the fasta defline and stops at first space

You could remove spaces in your fasta headers and get the full name by setting this number high.

ADD COMMENT • link 5.1 years ago by GenoMax 141k

0

Entering edit mode

Thanks! I didn't have any spaces in my sequence names, so setting -d 0 did the trick. Don't know why I didn't see that option in the help.

ADD REPLY • link 5.1 years ago by Sean R Johnson ▴ 120

0

Entering edit mode

Just to add, this option doesn't appear to exist in PSI-CD-HIT and some of the other tools (only the 'basic' version).

ADD REPLY • link 5.1 years ago by Joe 21k