can we retrieve protein sequences from the clusters of CD-hit?
1
0
Entering edit mode
6 months ago

Hello Can we retrieve all protein sequences in fasta forrmat from the clusters we get from CD-Hit?

CD-Hit protein sequences • 308 views
ADD COMMENT
0
Entering edit mode

There is not enough information in this one line question to provide a useful answer. Please add additional details. What are you clustering (DNA/Protein)? Where do you want to retrieve the sequence from?

ADD REPLY
0
Entering edit mode

I am running this command :

cd-hit -i merge_all_788_proteins.faa -o out_788_cd-hit50 -c 0.5 -n 2 -M 0

This gives me a fasta file having all the representative sequences (longest one) and .clust file having all the clusters file.

I need to get all protein sequences from clusters0 or cluster 1 and so on for their multiple sequence alignment.

ADD REPLY
0
Entering edit mode

make_multi_seq.pl (LINK) included in CD-HIT will do what you need based on the description.

ADD REPLY
0
Entering edit mode
6 months ago
Joe 19k

I have code that will work to achieve this:

https://github.com/jrjhealey/bioinfo-tools/blob/master/ParseCDHIT.py

Just be aware that because of a limitation of the way CD-HIT writes the names out, all your sequences must be uniquely named (and ideally short). You will also need to ensure you run CD-HIT with the -d parameter set to 0.

ADD COMMENT

Login before adding your answer.

Traffic: 2607 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6