Get the top X number of lines per unique value in one column, once you've sorted a text file using 'sort'
1
0
Entering edit mode
16 months ago

Hi everybody,

I have a text file with 19 columns (divided by tab) which I have sorted using a command such as:

sort -t\$'\t' -k1,1 -k11,11g -k12,12gr -k3,3g file > file_sorted


Now I would like to keep the top X number of lines per unique value in column 1. I know that if I do:

sort -u -k1,1 --merge file_sorted > file_sorted_merged


I will keep only the 1st line for each unique value in column 1. How can I keep the top X (for example, the top 5) lines for the same value in column 1 from the sorted file?

bash shell sort • 596 views
0
Entering edit mode

EDIT: You should edit your question and add how this is related to bioinformatics, or the post might be closed as off-topic.

Either switch to something with more in-memory state, like R or python, or use sub-shells. The sub-shell will pick X unique values per column and then you can use awk to pick N matches per input line from the sub-shell.

There would be an awful lot of trial and error and column-specific wrangling if you use awk, so I'd recommend using R.

5
Entering edit mode
16 months ago

I got a tool csvtk, the uniq command can do exactly what you want , check the last example.

csvtk uniq -t -f 1 -n 5


The behind logic is easy, use a map/hash-table (column value -> count) to track how many times you have met a row with cerntain value in the column you care. If <= N, print this line.

0
Entering edit mode

Cool! it does exactly what I was looing for! thanks a lot

0
Entering edit mode

I've moved shenwei's comment to an answer. Please accept it so the post is marked as solved.