Question

How to sort cd-hit-est cluster file

0

Entering edit mode

7 months ago

Mo ▴ 40

I have a cluster file from cd-hit-est having thousands of clusters and one of the cluster look like this:

>Cluster 1
0       22nt, >4972_98_CAATTGCAGCGACGCGCCCATT... *
1       21nt, >2017_373_CGGTGCAGCGACGCGCCCATT... at +/85.71%
2       21nt, >6627_68_CAGTGCAACGACGCGCCCATT... at +/85.71%
3       21nt, >3668_146_CAGTGCGGCGACGCGCCCATT... at +/85.71%
4       21nt, >17_8379_CAGTGCAGCGACGCGCCCATT... at +/90.48%
5       21nt, >14958_26_CAGTGCAGCGACGCGCCCAAT... at +/85.71%
6       21nt, >89394_3_TGGTGCAGCGACGCGCCCATT... at +/85.71%
7       20nt, >11579_35_CAGGCAGCGACGCGCCCATT... at +/85.00%

In each line, the numerical value just before the nucleotide sequence separated by the _ represents the number of reads for that sequence. For example, the first sequence has 98 reads and second has 373 and so on.

I want to sort every cluster in this file by this value, by that I mean I want the second line to be on top as it has the most number of reads in comparison to all other sequences in this cluster. The position of such lines in each cluster is random and there are 20,000 clusters.

Thank you so much for your help

sort cd-hit-est • 538 views

ADD COMMENT • link 7 months ago by Mo ▴ 40

0

Entering edit mode

Please do not post screenshots of data. It prevents people who want to help from doing so since no one wants to type this wall of text in manually.

Copy/paste the data into the post and then use the 101010 button to format it as code. This will maintain the formatting.

ADD REPLY • link 7 months ago by GenoMax 141k

0

Entering edit mode

Thanks, it is done.

ADD REPLY • link 7 months ago by Mo ▴ 40

score 1 · Accepted Answer · 2023-09-24

1

Entering edit mode

7 months ago

Pierre Lindenbaum 161k

something like this ? create new two columns , sort on those columns, reverse the format.

awk -F '_' '/^>/ {cluster=$0;next;} {printf("%s:%s:%s\n",cluster,$2,$0);}'  input.txt  |\
sort -t ':' -k1,1 -k2,2n |\
awk -F ':' 'BEGIN {P="";} {if(P!=$1) {print $1; P=$1;} print $3}'

ADD COMMENT • link 7 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hi,

Thank you so much, this has worked :)

ADD REPLY • link 7 months ago by Mo ▴ 40