Question: Bash: find duplicates based on 2 columns and keep duplicate with lowest value in a different column
0
gravatar for m93
14 months ago by
m93100
m93100 wrote:

I have a file like this:

chr snpid position alleleA alleleB score
chr1 id1 1720593 A T 2
chr4 id2 1795915 T A 2
chr9 id3 2405331 C G 0
chr10 id4 2791133 G C 1
chr11 id5 2849732 A C 2
chr13 id6 3220462 C A 4
chr19 id8 3278835 T A 3
chr19 id8 3278835 T A 4

I would like to first find the duplicates rows based on column2. I managed (I think) to do this using:

awk '{if (x[$2]) { y[$2]++; print $0; if (y[$2] == 1) { print x[$2] } } x[$2] = $0}' filename.txt > duplicates.txt

This gives me:

chr19 id8 3278835 T A 3
chr19 id8 3278835 T A 4

However, when I get the duplicates (and there will be multiple pairs in my actual file), for each pair, I want to find the row with the lowest value in column 6. This means my final output for this example would be a file with the row:

 chr14 id8 3278835 T A 3

Is there any way I can easily combine both steps? Should I be using a loop? Any help would be deeply appreciated.

awk bash duplicates • 2.4k views
ADD COMMENTlink modified 14 months ago by shenwei3564.2k • written 14 months ago by m93100
6
gravatar for Kevin Blighe
14 months ago by
Kevin Blighe32k
Republic of Ireland
Kevin Blighe32k wrote:

This is a somewhat cheeky way of doing it.

Re-using your code, I then numerically sort the data based on column 6 so that, for the duplicates that are printed, the min value in column 6 will always appear first. I then pipe this further into another awk command that only prints unique lines based on a key comprised of columns 1-5. When awk does this, it always prints the first line in the matching pair/group, which, because we've sorted it, will always have the minimum value in column 6.

awk '{if (x[$2]) { y[$2]++; print $0; if (y[$2] == 1) { print x[$2] } } x[$2] = $0}' filename.txt | sort -k6n | awk '!x[$1$2$3$4$5]++'

chr19 id8 3278835 T A 3

I have tested it and it works when there are other duplicate fields in the data, and also if you ever had bizarre values like 001 or 02 (sort with -kn will know that 001 is less than 02 or 2)

ADD COMMENTlink modified 14 months ago • written 14 months ago by Kevin Blighe32k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 769 users visited in the last hour