parsing cd-hit result
4.8 years ago
I have cluster sequences using cd-hit-est and now I want to filter out the parent or the representative sequence out of the cluster. Any suggestions?

4.8 years ago
Joseph Hughes ★ 2.9k

You can use the * symbol at the end of the line to pull out the representative sequence. I think that this script I wrote, will pull out the representative sequence: https://github.com/josephhughes/TCRclust/blob/master/sort_cdhit.pl

using:

sort-cdhit.pl -i INFILE.fa -o OUTFILE_rep.fa -clstr INFILE.clstr -rep


You will need to make sure you use the option -d 0 when you run cd-hit to be sure to get the complete identifier in the .clstr output file.

4.8 years ago
Ram 34k

Yes. grep.

Read the user guide - it mentions a pattern you can use to isolate the representative sequences.

There's also the included clstr2txt script that converts the output into a more parsing friendly format.

@Ram ,there is no such pattern mentioned dere...sorry if I am missing it