Bash: find duplicates based on 2 columns and keep duplicate with lowest value in a different column
1
0
Entering edit mode
6.7 years ago
m98 ▴ 420

I have a file like this:

chr snpid position alleleA alleleB score
chr1 id1 1720593 A T 2
chr4 id2 1795915 T A 2
chr9 id3 2405331 C G 0
chr10 id4 2791133 G C 1
chr11 id5 2849732 A C 2
chr13 id6 3220462 C A 4
chr19 id8 3278835 T A 3
chr19 id8 3278835 T A 4

I would like to first find the duplicates rows based on column2. I managed (I think) to do this using:

awk '{if (x[$2]) { y[$2]++; print $0; if (y[$2] == 1) { print x[$2] } } x[$2] = $0}' filename.txt > duplicates.txt

This gives me:

chr19 id8 3278835 T A 3
chr19 id8 3278835 T A 4

However, when I get the duplicates (and there will be multiple pairs in my actual file), for each pair, I want to find the row with the lowest value in column 6. This means my final output for this example would be a file with the row:

 chr14 id8 3278835 T A 3

Is there any way I can easily combine both steps? Should I be using a loop? Any help would be deeply appreciated.

bash awk duplicates • 11k views
ADD COMMENT
6
Entering edit mode
6.7 years ago

This is a somewhat cheeky way of doing it.

Re-using your code, I then numerically sort the data based on column 6 so that, for the duplicates that are printed, the min value in column 6 will always appear first. I then pipe this further into another awk command that only prints unique lines based on a key comprised of columns 1-5. When awk does this, it always prints the first line in the matching pair/group, which, because we've sorted it, will always have the minimum value in column 6.

awk '{if (x[$2]) { y[$2]++; print $0; if (y[$2] == 1) { print x[$2] } } x[$2] = $0}' filename.txt | sort -k6n | awk '!x[$1$2$3$4$5]++'

chr19 id8 3278835 T A 3

I have tested it and it works when there are other duplicate fields in the data, and also if you ever had bizarre values like 001 or 02 (sort with -kn will know that 001 is less than 02 or 2)

ADD COMMENT

Login before adding your answer.

Traffic: 1298 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6