Tutorial:Removing rows with duplicate values based on 1 or more key values - linux awk
1
5
Entering edit mode
6.6 years ago

Excel has a great feature to remove rows with duplicate values, but how to do it in linux?

Assuming data is tab-delimited and is sorted or unsorted:

Remove duplicates using column 1 as key

awk '!a[$1]++' MyData.tsv > MyData.DupsRemoved.tsv

Remove duplicates using columns 1, 2, and 3 as key

awk '!a[$1$2$3]++' MyData.tsv > MyData.DupsRemoved.tsv

If you have a different separator, specify with -F, e.g. awk -F ',' for comma-separated

awk linux bash • 14k views
ADD COMMENT
2
Entering edit mode

awk '!a[$1$2$3]++' is not ok for this data, some special joinning symbols are needed.

col1    col2    col3
1       2       3
12      3
1               23
ADD REPLY
0
Entering edit mode

Kevin, it will be great if you can explain this to us.

ADD REPLY
0
Entering edit mode

Hi Vijay, sorry that I did not give any sample data.

If we have the following data in MyData.tsv:

chr start   end ref var
2   345000  345000  T   G
2   345000  345000  T   G
2   376876  376876  A   C
2   345000  345000  T   C
2   345000  345000  T   G

awk '!a[$1]++' MyData.tsv (using column #1 as key) will produce:

chr start   end ref var
2   345000  345000  T   G

awk '!a[$1$2$3$4$5]++' MyData.tsv (using all columns as key) will produce:

chr start   end ref var
2   345000  345000  T   G
2   376876  376876  A   C
2   345000  345000  T   C

It is mainly useful for very large datasets of any type when you want to remove any duplicate rows

ADD REPLY
0
Entering edit mode

This is a neat trick, thank you!

ADD REPLY
4
Entering edit mode
6.6 years ago

It's fast as compared to a sort-based approach, but one should be careful building a hash table from very large datasets, to be sure that one has a computer with sufficient memory to store the intermediate hash table.

Another option with very large datasets is to sort the input. It is easy to remove duplicates from a sorted list. For non-BED files, one could specify LC_ALL=C and use sort | uniq or, better, to use sort -u to get uniques.

Sorting takes time, but it usually uses far less memory. Setting LC_ALL=C treats input as if it has single-byte characters, which speeds up sorting considerably. This will almost always work for genomic data, which rarely contains two- or four-byte characters such as those found in extended Unicode.

Processing of multibyte characters requires more resources and is slower. If you tell your computer to assume the input has single-byte characters, fewer resources are needed.

If you're sorting BED files (like your sample TSV file, minus the header line), one could use a sort-bed - | uniq approach. The sort-bed tool uses some tricks to be faster than GNU sort at sorting BED files.

ADD COMMENT

Login before adding your answer.

Traffic: 2671 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6