Question: Get the top X number of lines per unique value in one column, once you've sorted a text file using 'sort'
0
gravatar for niccolo.alfano
25 days ago by
niccolo.alfano0 wrote:

Hi everybody,

I have a text file with 19 columns (divided by tab) which I have sorted using a command such as:

sort -t$'\t' -k1,1 -k11,11g -k12,12gr -k3,3g file > file_sorted

Now I would like to keep the top X number of lines per unique value in column 1. I know that if I do:

sort -u -k1,1 --merge file_sorted > file_sorted_merged

I will keep only the 1st line for each unique value in column 1. How can I keep the top X (for example, the top 5) lines for the same value in column 1 from the sorted file?

Thanks a lot in advance

sort shell bash • 187 views
ADD COMMENTlink modified 25 days ago by GenoMax95k • written 25 days ago by niccolo.alfano0

EDIT: You should edit your question and add how this is related to bioinformatics, or the post might be closed as off-topic.

Either switch to something with more in-memory state, like R or python, or use sub-shells. The sub-shell will pick X unique values per column and then you can use awk to pick N matches per input line from the sub-shell.

There would be an awful lot of trial and error and column-specific wrangling if you use awk, so I'd recommend using R.

ADD REPLYlink modified 25 days ago • written 25 days ago by _r_am32k
5
gravatar for shenwei356
25 days ago by
shenwei3565.7k
China
shenwei3565.7k wrote:

I got a tool csvtk, the uniq command can do exactly what you want , check the last example.

csvtk uniq -t -f 1 -n 5

The behind logic is easy, use a map/hash-table (column value -> count) to track how many times you have met a row with cerntain value in the column you care. If <= N, print this line.

ADD COMMENTlink modified 25 days ago • written 25 days ago by shenwei3565.7k

Cool! it does exactly what I was looing for! thanks a lot

ADD REPLYlink written 25 days ago by niccolo.alfano0

I've moved shenwei's comment to an answer. Please accept it so the post is marked as solved.

ADD REPLYlink written 25 days ago by _r_am32k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2528 users visited in the last hour
_