Can we retrieve all protein sequences in fasta forrmat from the clusters we get from CD-Hit?
There is not enough information in this one line question to provide a useful answer. Please add additional details. What are you clustering (DNA/Protein)? Where do you want to retrieve the sequence from?
I am running this command :
cd-hit -i merge_all_788_proteins.faa -o out_788_cd-hit50 -c 0.5 -n 2 -M 0
This gives me a fasta file having all the representative sequences (longest one) and .clust file having all the clusters file.
I need to get all protein sequences from clusters0 or cluster 1 and so on for their multiple sequence alignment.
make_multi_seq.pl (LINK) included in CD-HIT will do what you need based on the description.
I have code that will work to achieve this:
Just be aware that because of a limitation of the way CD-HIT writes the names out, all your sequences must be uniquely named (and ideally short). You will also need to ensure you run CD-HIT with the -d parameter set to 0.
Login before adding your answer.
Use of this site constitutes acceptance of our User Agreement and Privacy