Question

Comparing values of a column if it falls between values of two columns of a different data set and creating new column

1

Entering edit mode

4.5 years ago

eDNAuRNA ▴ 20

Hi everyone,

I have two unequal data sets.

A data set
V1     V2          V3
6   42721754    42721769
6   42721757    42721772
6   42721760    42721775
6   42721763    42721778
6   42721766    42721781
6   42721769    42721784
6   42721772    42721787
6   42721775    42721790

B data set
V2              AF
42721757    0.003067485
42721760    0.006134969
42721763    0.006134969
42721766    0.003067485
42721769    0.006134969
42721772    0.006134969
42721775    0.003067485
42721778    0.006134969
42721781    0.003067485
42721784    0.003067485
42721787    0.009202454
42721790    0.009202454

I want to check if the value of V2 in B data set falls between the values of V2 and V3 of A data set. And when its true, I want value of AF in B data set to be added as a new column in A dataset. There are two important points. 1. B data set is 80 rows while A data set is 6000 rows. 2. value of V2 in B may be repeated, which shouldn't get thrown out in final output.

an urgent help with be appreciated.

Thanks,

awk R linux • 1.8k views

ADD COMMENT • link updated 4.5 years ago by zx8754 11k • written 4.5 years ago by eDNAuRNA ▴ 20

0

Entering edit mode

If you can convert "B data set" as bed file, you could use bedtools intersect.

ADD REPLY • link 4.5 years ago by Rashedul Islam ▴ 470

0

Entering edit mode

Thanks Rashedul, however bedtools might not work because of redundancy.

ADD REPLY • link 4.5 years ago by eDNAuRNA ▴ 20

0

Entering edit mode

Thanks Kevin,

That was quick. I tried but it didn't work for me, that's weird. I have exact same data sets and i copied your code. It didn't reproduce the product you have shown. Can you please take another look? Your all assumptions are correct except one thing I would like to clarify that I also want to use 'less than or equal to' and 'more than or equal to' options here. So, will >= and <= work?

Thanks a bunch.

ADD REPLY • link 4.5 years ago by eDNAuRNA ▴ 20

0

Entering edit mode

Hey, which system are you using? - Mac? Are you sure that your files are tab-delimited

ADD REPLY • link 4.5 years ago by Kevin Blighe 87k

0

Entering edit mode

I am using RedHat Enterprise Linux 7 and yes the files are tab -delimited. Will it work on mac? Thanks for the help again. Really appreciate and looking forward to solve this problem.

ADD REPLY • link 4.5 years ago by eDNAuRNA ▴ 20

0

Entering edit mode

Hey Kevin, I am trying to do it in R as well but I think there is a common problem which is causing this not to work with real dataset. My apologies. With example data it works on Mac but its not working with real dataset. I realized that the real dataset has many regions in A.txt which doesn't overlap with the B.txt. Please note that in real dataset, A.txt has about 10,000 rows while B.txt has about 80 rows. Can we introduce NA or zero for the entries which are present in A.txt but absent in B.txt? Thanks for your help.

ADD REPLY • link 4.5 years ago by eDNAuRNA ▴ 20

0

Entering edit mode

You just accepted our answers? - is the issue resolved?

ADD REPLY • link 4.5 years ago by Kevin Blighe 87k

0

Entering edit mode

I accepted answer for the example data given. However, I am still struggling with the real data.

ADD REPLY • link 4.5 years ago by eDNAuRNA ▴ 20

0

Entering edit mode

Feel free to ask a new question with example data that is more representative to your real data, and provide expected output.

ADD REPLY • link 4.5 years ago by zx8754 11k

0

Entering edit mode

This is called "merge on overlap", see this related StackOverflow post:

Overlap join with start and end positions

ADD REPLY • link 4.5 years ago by zx8754 11k

score 4 · Accepted Answer · 2019-10-13

Hey, you can try the idea of Rashedul; however, using BEDTools, you may encounter issues with the sorting of the regions and also the fact that you implied how the co-ordinates in B can be repeated.

Here is a solution in trusty AWK:

cat A.txt 
V1  V2  V3
6   42721754    42721769
6   42721757    42721772
6   42721760    42721775
6   42721763    42721778
6   42721766    42721781
6   42721769    42721784
6   42721772    42721787
6   42721775    42721790

cat B.txt 
V2  AF
42721757    0.003067485
42721760    0.006134969
42721763    0.006134969
42721766    0.003067485
42721769    0.006134969
42721772    0.006134969
42721775    0.003067485
42721778    0.006134969
42721781    0.003067485
42721784    0.003067485
42721787    0.009202454
42721790    0.009202454

awk 'FNR==NR \
  {if (NR>1) {arr[$1]=$2}; next} \
  {if (FNR>1) \
    {for (pos in arr) \
      if ((pos > $2) && (pos < $3)) \
        print $0"\t"arr[pos]}}' FS='\t' B.txt FS='\t' A.txt 
6   42721754    42721769    0.003067485
6   42721754    42721769    0.006134969
6   42721754    42721769    0.006134969
6   42721754    42721769    0.003067485
6   42721757    42721772    0.006134969
6   42721757    42721772    0.006134969
6   42721757    42721772    0.003067485
6   42721757    42721772    0.006134969
6   42721760    42721775    0.006134969
6   42721760    42721775    0.003067485
6   42721760    42721775    0.006134969
6   42721760    42721775    0.006134969
6   42721763    42721778    0.003067485
6   42721763    42721778    0.006134969
6   42721763    42721778    0.006134969
6   42721763    42721778    0.003067485
6   42721766    42721781    0.006134969
6   42721766    42721781    0.006134969
6   42721766    42721781    0.003067485
6   42721766    42721781    0.006134969
6   42721769    42721784    0.006134969
6   42721769    42721784    0.003067485
6   42721769    42721784    0.006134969
6   42721769    42721784    0.003067485
6   42721772    42721787    0.003067485
6   42721772    42721787    0.006134969
6   42721772    42721787    0.003067485
6   42721772    42721787    0.003067485
6   42721775    42721790    0.006134969
6   42721775    42721790    0.003067485
6   42721775    42721790    0.003067485
6   42721775    42721790    0.009202454

This assumes that the files are tab-delimited and that they genuinely have the headers that you imply they have (the code instructs to skip the first line of each). If the files are not tab-delimited, then modify the values set for FS.

Also, it assumes that you want to match between the regions specified in A.txt, so, 'less than' or 'greater than', but not 'less than or equal to' or 'greater than or equal to'. Modify > and < in the code to suit your needs.

Kevin

score 4 · Accepted Answer · 2019-10-14

4

Entering edit mode

4.5 years ago

zx8754 11k

Using R data.table:

library(data.table)

# example data
A <- fread(text = "
V1     V2          V3
6   42721754    42721769
6   42721757    42721772
6   42721760    42721775
6   42721763    42721778
6   42721766    42721781
6   42721769    42721784
6   42721772    42721787
6   42721775    42721790")

B <- fread(text = "
V2              AF
42721757    0.003067485
42721760    0.006134969
42721763    0.006134969
42721766    0.003067485
42721769    0.006134969
42721772    0.006134969
42721775    0.003067485
42721778    0.006134969
42721781    0.003067485
42721784    0.003067485
42721787    0.009202454
42721790    0.009202454")


# set the same keys for both A and B, start and end positions.
# If we have multiple chromosomes, we need to include chrom as key, too.
setDT(A, key = c("V2", "V3"))
setDT(B)
B <- B[, .(V2, V3 = V2, AF)]
setkeyv(B, c("V2", "V3"))

# merge with overlap, check the "type" argument for merge types, see ?foverlaps
res <- foverlaps(B, A)
head(res)
#    V1       V2       V3     i.V2     i.V3          AF
# 1:  6 42721754 42721769 42721757 42721757 0.003067485
# 2:  6 42721757 42721772 42721757 42721757 0.003067485
# 3:  6 42721754 42721769 42721760 42721760 0.006134969
# 4:  6 42721757 42721772 42721760 42721760 0.006134969
# 5:  6 42721760 42721775 42721760 42721760 0.006134969
# 6:  6 42721754 42721769 42721763 42721763 0.006134969

ADD COMMENT • link 4.5 years ago by zx8754 11k

0

Entering edit mode

Didn't work.

"Error in setkeyv(B, c("V2", "V3")) : some columns are not in the data.table: V2"

Thanks though.

ADD REPLY • link 4.5 years ago by eDNAuRNA ▴ 20

0

Entering edit mode

With provided example data set it works just fine. Do not expect direct copy pasting would 100% work, try to understand what each line is doing.

ADD REPLY • link 4.5 years ago by zx8754 11k

0

Entering edit mode

My apologies. You are absolutely right. With example data it works 100%. But its not working with real dataset. I realized that the real dataset has many regions in A.txt which doesn't overlap with the B.txt. Please note that in real dataset, A.txt has about 10,000 rows while B.txt has about 80 rows. Can we introduce NA or zero for the entries which are present in A.txt but absent in B.txt? Thanks for your help.

ADD REPLY • link 4.5 years ago by eDNAuRNA ▴ 20