It's fast as compared to a sort-based approach, but one should be careful building a hash table from very large datasets, to be sure that one has a computer with sufficient memory to store the intermediate hash table.
Another option with very large datasets is to sort the input. It is easy to remove duplicates from a sorted list. For non-BED files, one could specify
LC_ALL=C and use
sort | uniq or, better, to use
sort -u to get uniques.
Sorting takes time, but it usually uses far less memory. Setting
LC_ALL=C treats input as if it has single-byte characters, which speeds up sorting considerably. This will almost always work for genomic data, which rarely contains two- or four-byte characters such as those found in extended Unicode.
Processing of multibyte characters requires more resources and is slower. If you tell your computer to assume the input has single-byte characters, fewer resources are needed.
If you're sorting BED files (like your sample TSV file, minus the header line), one could use a
sort-bed - | uniq approach. The
sort-bed tool uses some tricks to be faster than GNU
sort at sorting BED files.
awk '!a[$1$2$3]++'is not ok for this data, some special joinning symbols are needed.
Kevin, it will be great if you can explain this to us.
Hi Vijay, sorry that I did not give any sample data.
If we have the following data in MyData.tsv:
awk '!a[$1]++' MyData.tsv(using column #1 as key) will produce:
awk '!a[$1$2$3$4$5]++' MyData.tsv(using all columns as key) will produce:
It is mainly useful for very large datasets of any type when you want to remove any duplicate rows
This is a neat trick, thank you!