cdhit clustering include full name in cluster file
1
0
Entering edit mode
5.1 years ago

Is there any way to trace which sequences are included in each cd-hit cluster?

I'm aware of the .clstr file that gets generated, but the names are truncated, and I can't figure out a way to get the whole names.

For example, here is a snippet of a cluster file from a recent cd-hit run:

>Cluster 100
0       570aa, >ncbi|1353254|Penici... at 86.67%
1       570aa, >ncbi|500485|Penicil... at 87.02%
2       486aa, >ncbi|1170229|Penici... at 91.36%
3       486aa, >ncbi|1170230|Penici... at 91.36%
4       486aa, >ncbi|1170230|Penici... at 91.36%
5       572aa, >ncbi|27334|Penicill... at 99.65%
6       1967aa, >ncbi|27334|Penicill... *
7       570aa, >ncbi|5078|Penicilli... at 87.54%
8       570aa, >ncbi|5078|Penicilli... at 87.54%
9       1967aa, >ncbi|40296|Penicill... at 96.24%
10      1967aa, >ncbi|40296|Penicill... at 96.24%
11      570aa, >ncbi|1346256|Penici... at 85.09%
12      570aa, >ncbi|1439352|Penici... at 86.67%
13      572aa, >ncbi|60172|Penicill... at 92.13%
14      572aa, >ncbi|60172|Penicill... at 92.48%
15      571aa, >ncbi|2136024|Penici... at 90.72%
16      572aa, >ncbi|293382|Penicil... at 91.43%

I would like to make a fasta file from just the sequences in this cluster, but I can't because the part of the name shown is not enough to uniquely identify the sequences.

The obvious solution would be to rename the sequences before running cd-hit, then rename them back afterwards, but it seems like there should be a more direct way.

cdhit sequence clustering • 3.4k views
ADD COMMENT
0
Entering edit mode

Where did you get the sequences from? If 1353254 refers to gi number you should be able to get the sequence from NCBI using Entrezdirect.

ADD REPLY
0
Entering edit mode

They aren't from genbank. They are gene models from genome assemblies. The ncbi|[0-9]+| indicates NCBI Taxonomy ID. I do that to be compatible with bbmap taxonomy tools.

ADD REPLY
1
Entering edit mode
5.1 years ago
GenoMax 141k

I see this option in README

-d  length of description in .clstr file, default 20    if set to 0, it takes the fasta defline and stops at first space

You could remove spaces in your fasta headers and get the full name by setting this number high.

ADD COMMENT
0
Entering edit mode

Thanks! I didn't have any spaces in my sequence names, so setting -d 0 did the trick. Don't know why I didn't see that option in the help.

ADD REPLY
0
Entering edit mode

Just to add, this option doesn't appear to exist in PSI-CD-HIT and some of the other tools (only the 'basic' version).

ADD REPLY

Login before adding your answer.

Traffic: 2551 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6