Entering edit mode
                    5.0 years ago
        marit.hetland
        
    
        ▴
    
    50
    Hi, I have output from snp-dists (https://github.com/tseemann/snp-dists) in molten format, e.g.:
seq1    seq2    1
seq1    seq3    2
seq2    seq1    1
seq2    seq3    3
seq3    seq1    2
seq3    seq2    3
The third column gives the number of SNPs between the pair of sequences given in columns 1 and 2. As you can see, these values are duplicated, as it shows both the combination seq1 seq2 and seq2 seq1. How can I (in R or bash preferably) remove the duplicate values?
Let's do code golf with benchmarks, here is my Python version if we are at it:
Benchmark: a file with 1 million entries (file size 1.7MB)
Python code above took 0.1 seconds and 18MB RAM.
The awk version took 0.3 seconds and used about 14 MB RAM
First version of the R code took 0.5 seconds and used about 400MB of RAM.
Simpler R code took 3 seconds and used about 400MB of RAM.