I am using CD-HIT-2d to analyse CDSs shared in several genomic regions between 10 genomes. I have managed to obtain the clustering output data which provides the information on the number of species which contains homologous CDSs.
For downstream analysis, I need to identify the genomes which contain the CDSs to determine their degree of conservation among the 10 genomes. Currently I am manually analysing these datasets by grouping them according to the number of genome hits (example: CDS shared by 10 genomes, 9 genomes, 8 genomes and so on).
I would like to know if it is possible to have a text manipulation awk or bash script which could count the number of hits between the two cluster header and then group and export the data accordingly? As the number of datasets I have is rather huge, having this script would help to shorten the amount of time for analysis immensely.
Result sample is as shown below:
Thank you very much in advance for any suggestion and help