CD-HIT clustering output analysis - exporting result according to numbers of hit
Entering edit mode
5.1 years ago
EulnayM • 0

I am using CD-HIT-2d to analyse CDSs shared in several genomic regions between 10 genomes. I have managed to obtain the clustering output data which provides the information on the number of species which contains homologous CDSs.

For downstream analysis, I need to identify the genomes which contain the CDSs to determine their degree of conservation among the 10 genomes. Currently I am manually analysing these datasets by grouping them according to the number of genome hits (example: CDS shared by 10 genomes, 9 genomes, 8 genomes and so on).

I would like to know if it is possible to have a text manipulation awk or bash script which could count the number of hits between the two cluster header and then group and export the data accordingly? As the number of datasets I have is rather huge, having this script would help to shorten the amount of time for analysis immensely.

Result sample is as shown below: CD-HIT-2d clustering result sample

Thank you very much in advance for any suggestion and help


CD-HIT awk bash shell • 2.5k views
Entering edit mode

Many useful scripts ship with cd-hit, including which transforms the cluster file into a format that is far easier to parse. I didn't really understand what you want to do, but some of the other scripts that ship with cd-hit might be even better suited for the task. Note, the documentation of these scripts is very poor and you will generally have to look at the code to see what they do..


Login before adding your answer.

Traffic: 2542 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6