I have a cluster file from cd-hit-est having thousands of clusters and one of the cluster look like this:
>Cluster 1
0 22nt, >4972_98_CAATTGCAGCGACGCGCCCATT... *
1 21nt, >2017_373_CGGTGCAGCGACGCGCCCATT... at +/85.71%
2 21nt, >6627_68_CAGTGCAACGACGCGCCCATT... at +/85.71%
3 21nt, >3668_146_CAGTGCGGCGACGCGCCCATT... at +/85.71%
4 21nt, >17_8379_CAGTGCAGCGACGCGCCCATT... at +/90.48%
5 21nt, >14958_26_CAGTGCAGCGACGCGCCCAAT... at +/85.71%
6 21nt, >89394_3_TGGTGCAGCGACGCGCCCATT... at +/85.71%
7 20nt, >11579_35_CAGGCAGCGACGCGCCCATT... at +/85.00%
In each line, the numerical value just before the nucleotide sequence separated by the _ represents the number of reads for that sequence. For example, the first sequence has 98 reads and second has 373 and so on.
I want to sort every cluster in this file by this value, by that I mean I want the second line to be on top as it has the most number of reads in comparison to all other sequences in this cluster. The position of such lines in each cluster is random and there are 20,000 clusters.
Thank you so much for your help
Please do not post screenshots of data. It prevents people who want to help from doing so since no one wants to type this wall of text in manually.
Copy/paste the data into the post and then use the
101010
button to format it ascode
. This will maintain the formatting.Thanks, it is done.