can we retrieve protein sequences from the clusters of CD-hit?
Hello Can we retrieve all protein sequences in fasta forrmat from the clusters we get from CD-Hit?

There is not enough information in this one line question to provide a useful answer. Please add additional details. What are you clustering (DNA/Protein)? Where do you want to retrieve the sequence from?

I am running this command :

cd-hit -i merge_all_788_proteins.faa -o out_788_cd-hit50 -c 0.5 -n 2 -M 0


This gives me a fasta file having all the representative sequences (longest one) and .clust file having all the clusters file.

I need to get all protein sequences from clusters0 or cluster 1 and so on for their multiple sequence alignment.

make_multi_seq.pl (LINK) included in CD-HIT will do what you need based on the description.

6 months ago
Joe 19k

I have code that will work to achieve this:

https://github.com/jrjhealey/bioinfo-tools/blob/master/ParseCDHIT.py

Just be aware that because of a limitation of the way CD-HIT writes the names out, all your sequences must be uniquely named (and ideally short). You will also need to ensure you run CD-HIT with the -d parameter set to 0.