Question

can we retrieve protein sequences from the clusters of CD-hit?

0

Entering edit mode

3.3 years ago

sharmatina189059 ▴ 110

Hello Can we retrieve all protein sequences in fasta forrmat from the clusters we get from CD-Hit?

CD-Hit protein sequences • 2.1k views

ADD COMMENT • link updated 2.3 years ago by Md ▴ 10 • written 3.3 years ago by sharmatina189059 ▴ 110

0

Entering edit mode

There is not enough information in this one line question to provide a useful answer. Please add additional details. What are you clustering (DNA/Protein)? Where do you want to retrieve the sequence from?

ADD REPLY • link 3.3 years ago by GenoMax 141k

0

Entering edit mode

I am running this command :

cd-hit -i merge_all_788_proteins.faa -o out_788_cd-hit50 -c 0.5 -n 2 -M 0

This gives me a fasta file having all the representative sequences (longest one) and .clust file having all the clusters file.

I need to get all protein sequences from clusters0 or cluster 1 and so on for their multiple sequence alignment.

ADD REPLY • link updated 3.3 years ago by Ram 43k • written 3.3 years ago by sharmatina189059 ▴ 110

0

Entering edit mode

make_multi_seq.pl (LINK) included in CD-HIT will do what you need based on the description.

ADD REPLY • link 3.3 years ago by GenoMax 141k

0

Entering edit mode

For doing CD-HIT cluster do we have to merge all proteine in single file, if so how we will do it, or is it possible to do clustering all fasta file by keeping in single directory,

ADD REPLY • link 2.3 years ago by Md ▴ 10

score 0 · Answer 1 · 2021-01-10

I have code that will work to achieve this:

https://github.com/jrjhealey/bioinfo-tools/blob/master/ParseCDHIT.py

Just be aware that because of a limitation of the way CD-HIT writes the names out, all your sequences must be uniquely named (and ideally short). You will also need to ensure you run CD-HIT with the -d parameter set to 0.