10 days ago by
Seattle, WA USA
It's fast as compared to a sort-based approach, but one should be careful building a hash table from very large datasets, to be sure that one has a computer with sufficient memory to store the intermediate hash table.
Another option with very large datasets is to sort the input. It is easy to remove duplicates from a sorted list. For non-BED files, one could specify
LC_ALL=C and use
sort | uniq or, better, to use
sort -u to get uniques.
Sorting takes time, but it usually uses far less memory. Setting
LC_ALL=C treats input as if it has single-byte characters, which speeds up sorting considerably. This will almost always work for genomic data, which rarely contains two- or four-byte characters such as those found in extended Unicode.
Processing of multibyte characters requires more resources and is slower. If you tell your computer to assume the input has single-byte characters, fewer resources are needed.
If you're sorting BED files (like your sample TSV file, minus the header line), one could use a
sort-bed - | uniq approach. The
sort-bed tool uses some tricks to be faster than GNU
sort at sorting BED files.