I have a cluster file from cd-hit-est having thousands of clusters and one of the cluster look like this:
>Cluster 1 0 22nt, >4972_98_CAATTGCAGCGACGCGCCCATT... * 1 21nt, >2017_373_CGGTGCAGCGACGCGCCCATT... at +/85.71% 2 21nt, >6627_68_CAGTGCAACGACGCGCCCATT... at +/85.71% 3 21nt, >3668_146_CAGTGCGGCGACGCGCCCATT... at +/85.71% 4 21nt, >17_8379_CAGTGCAGCGACGCGCCCATT... at +/90.48% 5 21nt, >14958_26_CAGTGCAGCGACGCGCCCAAT... at +/85.71% 6 21nt, >89394_3_TGGTGCAGCGACGCGCCCATT... at +/85.71% 7 20nt, >11579_35_CAGGCAGCGACGCGCCCATT... at +/85.00%
In each line, the numerical value just before the nucleotide sequence separated by the _ represents the number of reads for that sequence. For example, the first sequence has 98 reads and second has 373 and so on.
I want to sort every cluster in this file by this value, by that I mean I want the second line to be on top as it has the most number of reads in comparison to all other sequences in this cluster. The position of such lines in each cluster is random and there are 20,000 clusters.
Thank you so much for your help