Question

Remove rows with duplicate values based on two columns

1

Entering edit mode

5.0 years ago

marit.hetland ▴ 50

Hi, I have output from snp-dists (https://github.com/tseemann/snp-dists) in molten format, e.g.:

seq1    seq2    1
seq1    seq3    2
seq2    seq1    1
seq2    seq3    3
seq3    seq1    2
seq3    seq2    3

The third column gives the number of SNPs between the pair of sequences given in columns 1 and 2. As you can see, these values are duplicated, as it shows both the combination seq1 seq2 and seq2 seq1. How can I (in R or bash preferably) remove the duplicate values?

R bash snp-dists SNP • 1.7k views

ADD COMMENT • link updated 5.0 years ago by zx8754 12k • written 5.0 years ago by marit.hetland ▴ 50

1

Entering edit mode

5.0 years ago

Friederike 9.0k

in R, one way could be this:

>library(magrittr)
>df <- data.frame(one = sort(paste0("seq", rep(1:3,2))),
                 two = c(paste0("seq", c(2,3,1,2,1,2))),
                 SNP = c(1,2,1,3,2,3))
# add a column with a composite label that will always have the smaller seq first
>df$three <- apply(df, 1, function(x) paste(sort(c(x[["one"]], x[["two"]])), collapse=","))
# determine duplicates based on the composite label and the SNP entry
>df$duplicate <- df[, c("three","SNP")] %>% duplicated
# subset to include only those rows where $duplicate is FALSE (= opposite of TRUE)
>df[!df$duplicate,]
    one  two SNP     three duplicate
1 seq1 seq2   1 seq1,seq2     FALSE
2 seq1 seq3   2 seq1,seq3     FALSE
4 seq2 seq2   3 seq2,seq2     FALSE
6 seq3 seq2   3 seq2,seq3     FALSE

ADD COMMENT • link 5.0 years ago by Friederike 9.0k

0

Entering edit mode

5.0 years ago

mohammadhassanj ▴ 260

a python solution

file = map(lambda x:x.strip().split("\t"),open("yourTabDelimitedColumnFile.txt").readlines())
result = {}
for seq1,seq2,snp in file:
  if seq1+seq2 in result or (seq2+seq1 in result and result[seq2+seq1][-1] == snp):
    continue 
  result[seq1+seq2] = [seq1,seq2,snp]
with open("result.txt","w") as file :
  for line in result.values():
    file.write("\t".join(line)+"\n")

ADD COMMENT • link 5.0 years ago by mohammadhassanj ▴ 260

score 4 · Accepted Answer · 2020-10-19

4

Entering edit mode

5.0 years ago

zx8754 12k

Using awk:

$ awk '!(seen[$1,$2]++ || seen[$2,$1]++)' test.txt
seq1 seq2 1
seq1 seq3 2
seq2 seq3 3

Using R:

# example data
x <- read.table(text = "seq1    seq2    1
seq1    seq3    2
seq2    seq1    1
seq2    seq3    3
seq3    seq1    2
seq3    seq2    3")

# sort column values, then get unique
unique(data.frame(c1 = pmin(x$V1, x$V2), c2 = pmax(x$V1, x$V2), value = x$V3))
#     c1   c2 value
# 1 seq1 seq2     1
# 2 seq1 seq3     2
# 4 seq2 seq3     3

Using R again, a bit simpler and scales better when we have more than 2 columns, (Related StackOverflow post):

x[ !duplicated(apply(x[, 1:2], 1, sort), MARGIN = 2), ]

ADD COMMENT • link 5.0 years ago by zx8754 12k

3

Entering edit mode

Let's do code golf with benchmarks, here is my Python version if we are at it:

import sys
seen = set()
for line in sys.stdin:
    a,b,c = line.split()
    if (a,b) not in seen:
       seen.add((a,b))
       seen.add((b,a))
       print (line, end='')

Benchmark: a file with 1 million entries (file size 1.7MB)

Python code above took 0.1 seconds and 18MB RAM.
The awk version took 0.3 seconds and used about 14 MB RAM
First version of the R code took 0.5 seconds and used about 400MB of RAM.
Simpler R code took 3 seconds and used about 400MB of RAM.

ADD REPLY • link 5.0 years ago by Istvan Albert 103k