How to sort cd-hit-est cluster file
1
0
Entering edit mode
7 months ago
Mo ▴ 40

I have a cluster file from cd-hit-est having thousands of clusters and one of the cluster look like this:

>Cluster 1
0       22nt, >4972_98_CAATTGCAGCGACGCGCCCATT... *
1       21nt, >2017_373_CGGTGCAGCGACGCGCCCATT... at +/85.71%
2       21nt, >6627_68_CAGTGCAACGACGCGCCCATT... at +/85.71%
3       21nt, >3668_146_CAGTGCGGCGACGCGCCCATT... at +/85.71%
4       21nt, >17_8379_CAGTGCAGCGACGCGCCCATT... at +/90.48%
5       21nt, >14958_26_CAGTGCAGCGACGCGCCCAAT... at +/85.71%
6       21nt, >89394_3_TGGTGCAGCGACGCGCCCATT... at +/85.71%
7       20nt, >11579_35_CAGGCAGCGACGCGCCCATT... at +/85.00%

In each line, the numerical value just before the nucleotide sequence separated by the _ represents the number of reads for that sequence. For example, the first sequence has 98 reads and second has 373 and so on.

I want to sort every cluster in this file by this value, by that I mean I want the second line to be on top as it has the most number of reads in comparison to all other sequences in this cluster. The position of such lines in each cluster is random and there are 20,000 clusters.

Thank you so much for your help

sort cd-hit-est • 538 views
ADD COMMENT
0
Entering edit mode

Please do not post screenshots of data. It prevents people who want to help from doing so since no one wants to type this wall of text in manually.

Copy/paste the data into the post and then use the 101010 button to format it as code. This will maintain the formatting.

ADD REPLY
0
Entering edit mode

Thanks, it is done.

ADD REPLY
1
Entering edit mode
7 months ago

something like this ? create new two columns , sort on those columns, reverse the format.

awk -F '_' '/^>/ {cluster=$0;next;} {printf("%s:%s:%s\n",cluster,$2,$0);}'  input.txt  |\
sort -t ':' -k1,1 -k2,2n |\
awk -F ':' 'BEGIN {P="";} {if(P!=$1) {print $1; P=$1;} print $3}'
ADD COMMENT
0
Entering edit mode

Hi,

Thank you so much, this has worked :)

ADD REPLY

Login before adding your answer.

Traffic: 1843 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6