Tutorial: Removing rows with duplicate values based on 1 or more key values - linux awk
1
gravatar for Kevin Blighe
11 days ago by
Kevin Blighe1.1k
Republic of Ireland
Kevin Blighe1.1k wrote:

Excel has a great feature to remove rows with duplicate values, but how to do it in linux?

Assuming data is tab-delimited and is sorted or unsorted:

Remove duplicates using column 1 as key

awk '!a[$1]++' MyData.tsv > MyData.DupsRemoved.tsv

Remove duplicates using columns 1, 2, and 3 as key

awk '!a[$1$2$3]++' MyData.tsv > MyData.DupsRemoved.tsv

If you have a different separator, specify with -F, e.g. awk -F ',' for comma-separated

bash awk linux tutorial duplicates • 191 views
ADD COMMENTlink modified 10 days ago by Alex Reynolds20k • written 11 days ago by Kevin Blighe1.1k
2

awk '!a[$1$2$3]++' is not ok for this data, some special joinning symbols are needed.

col1    col2    col3
1       2       3
12      3
1               23
ADD REPLYlink modified 11 days ago • written 11 days ago by shenwei3563.3k

Good catch! How would you join in the [$1$1] notation?

ADD REPLYlink written 11 days ago by Ram12k

I would hope that these blank fields were coded as 'NA', otherwise, this type of data will cause problems for virtually all programs/scripts.

ADD REPLYlink written 10 days ago by Kevin Blighe1.1k

Kevin, it will be great if you can explain this to us.

ADD REPLYlink written 11 days ago by Vijay Lakhujani1.2k

Hi Vijay, sorry that I did not give any sample data.

If we have the following data in MyData.tsv:

chr start   end ref var
2   345000  345000  T   G
2   345000  345000  T   G
2   376876  376876  A   C
2   345000  345000  T   C
2   345000  345000  T   G

awk '!a[$1]++' MyData.tsv (using column #1 as key) will produce:

chr start   end ref var
2   345000  345000  T   G

awk '!a[$1$2$3$4$5]++' MyData.tsv (using all columns as key) will produce:

chr start   end ref var
2   345000  345000  T   G
2   376876  376876  A   C
2   345000  345000  T   C

It is mainly useful for very large datasets of any type when you want to remove any duplicate rows

ADD REPLYlink modified 11 days ago • written 11 days ago by Kevin Blighe1.1k

This is a neat trick, thank you!

ADD REPLYlink written 11 days ago by Ram12k
2
gravatar for Alex Reynolds
10 days ago by
Alex Reynolds20k
Seattle, WA USA
Alex Reynolds20k wrote:

It's fast as compared to a sort-based approach, but one should be careful building a hash table from very large datasets, to be sure that one has a computer with sufficient memory to store the intermediate hash table.

Another option with very large datasets is to sort the input. It is easy to remove duplicates from a sorted list. For non-BED files, one could specify LC_ALL=C and use sort | uniq or, better, to use sort -u to get uniques.

Sorting takes time, but it usually uses far less memory. Setting LC_ALL=C treats input as if it has single-byte characters, which speeds up sorting considerably. This will almost always work for genomic data, which rarely contains two- or four-byte characters such as those found in extended Unicode.

Processing of multibyte characters requires more resources and is slower. If you tell your computer to assume the input has single-byte characters, fewer resources are needed.

If you're sorting BED files (like your sample TSV file, minus the header line), one could use a sort-bed - | uniq approach. The sort-bed tool uses some tricks to be faster than GNU sort at sorting BED files.

ADD COMMENTlink written 10 days ago by Alex Reynolds20k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1388 users visited in the last hour