Tutorial: Removing rows with duplicate values based on 1 or more key values - linux awk
1
gravatar for Kevin Blighe
10 weeks ago by
Kevin Blighe7.3k
Republic of Ireland (√Čire)
Kevin Blighe7.3k wrote:

Excel has a great feature to remove rows with duplicate values, but how to do it in linux?

Assuming data is tab-delimited and is sorted or unsorted:

Remove duplicates using column 1 as key

awk '!a[$1]++' MyData.tsv > MyData.DupsRemoved.tsv

Remove duplicates using columns 1, 2, and 3 as key

awk '!a[$1$2$3]++' MyData.tsv > MyData.DupsRemoved.tsv

If you have a different separator, specify with -F, e.g. awk -F ',' for comma-separated

bash awk linux tutorial duplicates • 304 views
ADD COMMENTlink modified 10 weeks ago by Alex Reynolds21k • written 10 weeks ago by Kevin Blighe7.3k
2

awk '!a[$1$2$3]++' is not ok for this data, some special joinning symbols are needed.

col1    col2    col3
1       2       3
12      3
1               23
ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by shenwei3563.4k

Good catch! How would you join in the [$1$1] notation?

ADD REPLYlink written 10 weeks ago by Ram12k

I would hope that these blank fields were coded as 'NA', otherwise, this type of data will cause problems for virtually all programs/scripts.

ADD REPLYlink written 10 weeks ago by Kevin Blighe7.3k

Kevin, it will be great if you can explain this to us.

ADD REPLYlink written 10 weeks ago by Vijay Lakhujani1.3k

Hi Vijay, sorry that I did not give any sample data.

If we have the following data in MyData.tsv:

chr start   end ref var
2   345000  345000  T   G
2   345000  345000  T   G
2   376876  376876  A   C
2   345000  345000  T   C
2   345000  345000  T   G

awk '!a[$1]++' MyData.tsv (using column #1 as key) will produce:

chr start   end ref var
2   345000  345000  T   G

awk '!a[$1$2$3$4$5]++' MyData.tsv (using all columns as key) will produce:

chr start   end ref var
2   345000  345000  T   G
2   376876  376876  A   C
2   345000  345000  T   C

It is mainly useful for very large datasets of any type when you want to remove any duplicate rows

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by Kevin Blighe7.3k

This is a neat trick, thank you!

ADD REPLYlink written 10 weeks ago by Ram12k
3
gravatar for Alex Reynolds
10 weeks ago by
Alex Reynolds21k
Seattle, WA USA
Alex Reynolds21k wrote:

It's fast as compared to a sort-based approach, but one should be careful building a hash table from very large datasets, to be sure that one has a computer with sufficient memory to store the intermediate hash table.

Another option with very large datasets is to sort the input. It is easy to remove duplicates from a sorted list. For non-BED files, one could specify LC_ALL=C and use sort | uniq or, better, to use sort -u to get uniques.

Sorting takes time, but it usually uses far less memory. Setting LC_ALL=C treats input as if it has single-byte characters, which speeds up sorting considerably. This will almost always work for genomic data, which rarely contains two- or four-byte characters such as those found in extended Unicode.

Processing of multibyte characters requires more resources and is slower. If you tell your computer to assume the input has single-byte characters, fewer resources are needed.

If you're sorting BED files (like your sample TSV file, minus the header line), one could use a sort-bed - | uniq approach. The sort-bed tool uses some tricks to be faster than GNU sort at sorting BED files.

ADD COMMENTlink written 10 weeks ago by Alex Reynolds21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 703 users visited in the last hour