cdhit clustering include full name in cluster file
1
0
Entering edit mode
3.8 years ago

Is there any way to trace which sequences are included in each cd-hit cluster?

I'm aware of the .clstr file that gets generated, but the names are truncated, and I can't figure out a way to get the whole names.

For example, here is a snippet of a cluster file from a recent cd-hit run:

>Cluster 100
0       570aa, >ncbi|1353254|Penici... at 86.67%
1       570aa, >ncbi|500485|Penicil... at 87.02%
2       486aa, >ncbi|1170229|Penici... at 91.36%
3       486aa, >ncbi|1170230|Penici... at 91.36%
4       486aa, >ncbi|1170230|Penici... at 91.36%
5       572aa, >ncbi|27334|Penicill... at 99.65%
6       1967aa, >ncbi|27334|Penicill... *
7       570aa, >ncbi|5078|Penicilli... at 87.54%
8       570aa, >ncbi|5078|Penicilli... at 87.54%
9       1967aa, >ncbi|40296|Penicill... at 96.24%
10      1967aa, >ncbi|40296|Penicill... at 96.24%
11      570aa, >ncbi|1346256|Penici... at 85.09%
12      570aa, >ncbi|1439352|Penici... at 86.67%
13      572aa, >ncbi|60172|Penicill... at 92.13%
14      572aa, >ncbi|60172|Penicill... at 92.48%
15      571aa, >ncbi|2136024|Penici... at 90.72%
16      572aa, >ncbi|293382|Penicil... at 91.43%


I would like to make a fasta file from just the sequences in this cluster, but I can't because the part of the name shown is not enough to uniquely identify the sequences.

The obvious solution would be to rename the sequences before running cd-hit, then rename them back afterwards, but it seems like there should be a more direct way.

cdhit sequence clustering • 2.4k views
0
Entering edit mode

Where did you get the sequences from? If 1353254 refers to gi number you should be able to get the sequence from NCBI using Entrezdirect.

0
Entering edit mode

They aren't from genbank. They are gene models from genome assemblies. The ncbi|[0-9]+| indicates NCBI Taxonomy ID. I do that to be compatible with bbmap taxonomy tools.

1
Entering edit mode
3.8 years ago
GenoMax 123k

I see this option in README

-d  length of description in .clstr file, default 20    if set to 0, it takes the fasta defline and stops at first space


You could remove spaces in your fasta headers and get the full name by setting this number high.

0
Entering edit mode

Thanks! I didn't have any spaces in my sequence names, so setting -d 0 did the trick. Don't know why I didn't see that option in the help.

0
Entering edit mode

Just to add, this option doesn't appear to exist in PSI-CD-HIT and some of the other tools (only the 'basic' version).