Question: how to remove rows based on certain characters
5 days ago by
zqwu0 wrote:

Dear all ,

I have a file over 30000 rows (\t as the space), I want to remove some based on certain characters:

for example:

Name    Len Name2   Order

KCNQ2_32937 2535    KCNQ2   32937

KCNQ2_32938 2733    KCNQ2   32938

KCNQ2_32939 2616    KCNQ2   32939

KCNQ2_32940 2544    KCNQ2   32940

KCNQ2_32941 1809    KCNQ2   32941


the filter is like this:

In Name2 column, if the name of each cell is the same, I want keep the largest one in Len column:

Name    Len Name2   Order

KCNQ2_32938 2733    KCNQ2   32938


How can I do it like this?


R
modified 5 days ago by 5heikki6.1k • written 5 days ago by zqwu0
5 days ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum91k wrote:

sort column 3 and then column 2 (reverse number) , followed by a stable sort/uniq on column 3

sort -t $'\t ' -k3,3 -k2,2rn input.tsv | sort -t $'\t ' -k3,3 -u --stable
modified 4 days ago • written 5 days ago by Pierre Lindenbaum91k

thanks. It is fast and exactly what I need.

written 5 days ago by zqwu0

If this answer solved your problem then go ahead and "accept" (green check mark). @5heikki's answer which appears to have been written almost at the same time may also be fine and can be accepted in addition to @Pierre's.

ADD REPLYlink modified 4 days ago • written 4 days ago by genomax224k
5 days ago by
5heikki6.1k wrote:
sort -t $'\t' -k3,3 -k2,2gr file | sort -t $'\t' -u -k3,3

Also: man sort

written 5 days ago by 5heikki6.1k
