Remove rows with duplicate values based on two columns
3
1
Entering edit mode
3.5 years ago

Hi, I have output from snp-dists (https://github.com/tseemann/snp-dists) in molten format, e.g.:

seq1    seq2    1
seq1    seq3    2
seq2    seq1    1
seq2    seq3    3
seq3    seq1    2
seq3    seq2    3

The third column gives the number of SNPs between the pair of sequences given in columns 1 and 2. As you can see, these values are duplicated, as it shows both the combination seq1 seq2 and seq2 seq1. How can I (in R or bash preferably) remove the duplicate values?

R bash snp-dists SNP • 1.0k views
ADD COMMENT
4
Entering edit mode
3.5 years ago
zx8754 11k

Using awk:

$ awk '!(seen[$1,$2]++ || seen[$2,$1]++)' test.txt
seq1 seq2 1
seq1 seq3 2
seq2 seq3 3

Using R:

# example data
x <- read.table(text = "seq1    seq2    1
seq1    seq3    2
seq2    seq1    1
seq2    seq3    3
seq3    seq1    2
seq3    seq2    3")

# sort column values, then get unique
unique(data.frame(c1 = pmin(x$V1, x$V2), c2 = pmax(x$V1, x$V2), value = x$V3))
#     c1   c2 value
# 1 seq1 seq2     1
# 2 seq1 seq3     2
# 4 seq2 seq3     3

Using R again, a bit simpler and scales better when we have more than 2 columns, (Related StackOverflow post):

x[ !duplicated(apply(x[, 1:2], 1, sort), MARGIN = 2), ]
ADD COMMENT
3
Entering edit mode

Let's do code golf with benchmarks, here is my Python version if we are at it:

import sys
seen = set()
for line in sys.stdin:
    a,b,c = line.split()
    if (a,b) not in seen:
       seen.add((a,b))
       seen.add((b,a))
       print (line, end='')

Benchmark: a file with 1 million entries (file size 1.7MB)

  1. Python code above took 0.1 seconds and 18MB RAM.

  2. The awk version took 0.3 seconds and used about 14 MB RAM

  3. First version of the R code took 0.5 seconds and used about 400MB of RAM.

  4. Simpler R code took 3 seconds and used about 400MB of RAM.

ADD REPLY
1
Entering edit mode
3.5 years ago

in R, one way could be this:

>library(magrittr)
>df <- data.frame(one = sort(paste0("seq", rep(1:3,2))),
                 two = c(paste0("seq", c(2,3,1,2,1,2))),
                 SNP = c(1,2,1,3,2,3))
# add a column with a composite label that will always have the smaller seq first
>df$three <- apply(df, 1, function(x) paste(sort(c(x[["one"]], x[["two"]])), collapse=","))
# determine duplicates based on the composite label and the SNP entry
>df$duplicate <- df[, c("three","SNP")] %>% duplicated
# subset to include only those rows where $duplicate is FALSE (= opposite of TRUE)
>df[!df$duplicate,]
    one  two SNP     three duplicate
1 seq1 seq2   1 seq1,seq2     FALSE
2 seq1 seq3   2 seq1,seq3     FALSE
4 seq2 seq2   3 seq2,seq2     FALSE
6 seq3 seq2   3 seq2,seq3     FALSE
ADD COMMENT
0
Entering edit mode
3.5 years ago

a python solution

file = map(lambda x:x.strip().split("\t"),open("yourTabDelimitedColumnFile.txt").readlines())
result = {}
for seq1,seq2,snp in file:
  if seq1+seq2 in result or (seq2+seq1 in result and result[seq2+seq1][-1] == snp):
    continue 
  result[seq1+seq2] = [seq1,seq2,snp]
with open("result.txt","w") as file :
  for line in result.values():
    file.write("\t".join(line)+"\n")
ADD COMMENT

Login before adding your answer.

Traffic: 2683 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6