Tutorial: Removing rows with duplicate values based on 1 or more key values - linux awk
0
gravatar for Kevin Blighe
9 months ago by
Kevin Blighe21k
University College London Cancer Institute
Kevin Blighe21k wrote:

Excel has a great feature to remove rows with duplicate values, but how to do it in linux?

Assuming data is tab-delimited and is sorted or unsorted:

Remove duplicates using column 1 as key

awk '!a[$1]++' MyData.tsv > MyData.DupsRemoved.tsv

Remove duplicates using columns 1, 2, and 3 as key

awk '!a[$1$2$3]++' MyData.tsv > MyData.DupsRemoved.tsv

If you have a different separator, specify with -F, e.g. awk -F ',' for comma-separated

bash awk linux tutorial duplicates • 1.4k views
ADD COMMENTlink modified 9 months ago by Alex Reynolds24k • written 9 months ago by Kevin Blighe21k
2

awk '!a[$1$2$3]++' is not ok for this data, some special joinning symbols are needed.

col1    col2    col3
1       2       3
12      3
1               23
ADD REPLYlink modified 9 months ago • written 9 months ago by shenwei3563.8k

Good catch! How would you join in the [$1$1] notation?

ADD REPLYlink written 9 months ago by Ram15k

I would hope that these blank fields were coded as 'NA', otherwise, this type of data will cause problems for virtually all programs/scripts.

ADD REPLYlink written 9 months ago by Kevin Blighe21k

Kevin, it will be great if you can explain this to us.

ADD REPLYlink written 9 months ago by Vijay Lakhujani2.5k

Hi Vijay, sorry that I did not give any sample data.

If we have the following data in MyData.tsv:

chr start   end ref var
2   345000  345000  T   G
2   345000  345000  T   G
2   376876  376876  A   C
2   345000  345000  T   C
2   345000  345000  T   G

awk '!a[$1]++' MyData.tsv (using column #1 as key) will produce:

chr start   end ref var
2   345000  345000  T   G

awk '!a[$1$2$3$4$5]++' MyData.tsv (using all columns as key) will produce:

chr start   end ref var
2   345000  345000  T   G
2   376876  376876  A   C
2   345000  345000  T   C

It is mainly useful for very large datasets of any type when you want to remove any duplicate rows

ADD REPLYlink modified 9 months ago • written 9 months ago by Kevin Blighe21k

This is a neat trick, thank you!

ADD REPLYlink written 9 months ago by Ram15k
3
gravatar for Alex Reynolds
9 months ago by
Alex Reynolds24k
Seattle, WA USA
Alex Reynolds24k wrote:

It's fast as compared to a sort-based approach, but one should be careful building a hash table from very large datasets, to be sure that one has a computer with sufficient memory to store the intermediate hash table.

Another option with very large datasets is to sort the input. It is easy to remove duplicates from a sorted list. For non-BED files, one could specify LC_ALL=C and use sort | uniq or, better, to use sort -u to get uniques.

Sorting takes time, but it usually uses far less memory. Setting LC_ALL=C treats input as if it has single-byte characters, which speeds up sorting considerably. This will almost always work for genomic data, which rarely contains two- or four-byte characters such as those found in extended Unicode.

Processing of multibyte characters requires more resources and is slower. If you tell your computer to assume the input has single-byte characters, fewer resources are needed.

If you're sorting BED files (like your sample TSV file, minus the header line), one could use a sort-bed - | uniq approach. The sort-bed tool uses some tricks to be faster than GNU sort at sorting BED files.

ADD COMMENTlink written 9 months ago by Alex Reynolds24k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1030 users visited in the last hour