Question

Tutorial:Removing rows with duplicate values based on 1 or more key values - linux awk

5

Entering edit mode

6.9 years ago

Kevin Blighe 88k

Excel has a great feature to remove rows with duplicate values, but how to do it in linux?

Assuming data is tab-delimited and is sorted or unsorted:

Remove duplicates using column 1 as key

awk '!a[$1]++' MyData.tsv > MyData.DupsRemoved.tsv

Remove duplicates using columns 1, 2, and 3 as key

awk '!a[$1$2$3]++' MyData.tsv > MyData.DupsRemoved.tsv

If you have a different separator, specify with -F, e.g. awk -F ',' for comma-separated

awk linux bash • 14k views

ADD COMMENT • link updated 14 months ago by Ram 44k • written 6.9 years ago by Kevin Blighe 88k

2

Entering edit mode

awk '!a[$1$2$3]++' is not ok for this data, some special joinning symbols are needed.

col1    col2    col3
1       2       3
12      3
1               23

ADD REPLY • link 6.9 years ago by shenwei356 8.6k

0

Entering edit mode

Kevin, it will be great if you can explain this to us.

ADD REPLY • link 6.9 years ago by lakhujanivijay 5.8k

0

Entering edit mode

Hi Vijay, sorry that I did not give any sample data.

If we have the following data in MyData.tsv:

chr start   end ref var
2   345000  345000  T   G
2   345000  345000  T   G
2   376876  376876  A   C
2   345000  345000  T   C
2   345000  345000  T   G

awk '!a[$1]++' MyData.tsv (using column #1 as key) will produce:

chr start   end ref var
2   345000  345000  T   G

awk '!a[$1$2$3$4$5]++' MyData.tsv (using all columns as key) will produce:

chr start   end ref var
2   345000  345000  T   G
2   376876  376876  A   C
2   345000  345000  T   C

It is mainly useful for very large datasets of any type when you want to remove any duplicate rows

ADD REPLY • link 6.9 years ago by Kevin Blighe 88k

0

Entering edit mode

This is a neat trick, thank you!

ADD REPLY • link 6.9 years ago by Ram 44k

score 4 · Answer 1 · 2017-09-10

It's fast as compared to a sort-based approach, but one should be careful building a hash table from very large datasets, to be sure that one has a computer with sufficient memory to store the intermediate hash table.

Another option with very large datasets is to sort the input. It is easy to remove duplicates from a sorted list. For non-BED files, one could specify LC_ALL=C and use sort | uniq or, better, to use sort -u to get uniques.

Sorting takes time, but it usually uses far less memory. Setting LC_ALL=C treats input as if it has single-byte characters, which speeds up sorting considerably. This will almost always work for genomic data, which rarely contains two- or four-byte characters such as those found in extended Unicode.

Processing of multibyte characters requires more resources and is slower. If you tell your computer to assume the input has single-byte characters, fewer resources are needed.

If you're sorting BED files (like your sample TSV file, minus the header line), one could use a sort-bed - | uniq approach. The sort-bed tool uses some tricks to be faster than GNU sort at sorting BED files.