Question: Comparing values of a column if it falls between values of two columns of a different data set and creating new column
1
gravatar for eDNAuRNA
5 weeks ago by
eDNAuRNA20
eDNAuRNA20 wrote:

Hi everyone,

I have two unequal data sets.

A data set
V1     V2          V3
6   42721754    42721769
6   42721757    42721772
6   42721760    42721775
6   42721763    42721778
6   42721766    42721781
6   42721769    42721784
6   42721772    42721787
6   42721775    42721790

B data set
V2              AF
42721757    0.003067485
42721760    0.006134969
42721763    0.006134969
42721766    0.003067485
42721769    0.006134969
42721772    0.006134969
42721775    0.003067485
42721778    0.006134969
42721781    0.003067485
42721784    0.003067485
42721787    0.009202454
42721790    0.009202454

I want to check if the value of V2 in B data set falls between the values of V2 and V3 of A data set. And when its true, I want value of AF in B data set to be added as a new column in A dataset. There are two important points. 1. B data set is 80 rows while A data set is 6000 rows. 2. value of V2 in B may be repeated, which shouldn't get thrown out in final output.

an urgent help with be appreciated.

Thanks,

linux awk R • 334 views
ADD COMMENTlink modified 5 weeks ago by zx87548.4k • written 5 weeks ago by eDNAuRNA20

If you can convert "B data set" as bed file, you could use bedtools intersect.

ADD REPLYlink written 5 weeks ago by Rashedul Islam340

Thanks Rashedul, however bedtools might not work because of redundancy.

ADD REPLYlink written 5 weeks ago by eDNAuRNA20

Thanks Kevin,

That was quick. I tried but it didn't work for me, that's weird. I have exact same data sets and i copied your code. It didn't reproduce the product you have shown. Can you please take another look? Your all assumptions are correct except one thing I would like to clarify that I also want to use 'less than or equal to' and 'more than or equal to' options here. So, will >= and <= work?

Thanks a bunch.

ADD REPLYlink written 5 weeks ago by eDNAuRNA20

Hey, which system are you using? - Mac? Are you sure that your files are tab-delimited

ADD REPLYlink written 5 weeks ago by Kevin Blighe51k

I am using RedHat Enterprise Linux 7 and yes the files are tab -delimited. Will it work on mac? Thanks for the help again. Really appreciate and looking forward to solve this problem.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by eDNAuRNA20

Hey Kevin, I am trying to do it in R as well but I think there is a common problem which is causing this not to work with real dataset. My apologies. With example data it works on Mac but its not working with real dataset. I realized that the real dataset has many regions in A.txt which doesn't overlap with the B.txt. Please note that in real dataset, A.txt has about 10,000 rows while B.txt has about 80 rows. Can we introduce NA or zero for the entries which are present in A.txt but absent in B.txt? Thanks for your help.

ADD REPLYlink written 4 weeks ago by eDNAuRNA20

You just accepted our answers? - is the issue resolved?

ADD REPLYlink written 4 weeks ago by Kevin Blighe51k

I accepted answer for the example data given. However, I am still struggling with the real data.

ADD REPLYlink written 4 weeks ago by eDNAuRNA20

Feel free to ask a new question with example data that is more representative to your real data, and provide expected output.

ADD REPLYlink written 4 weeks ago by zx87548.4k

This is called "merge on overlap", see this related StackOverflow post:

ADD REPLYlink written 5 weeks ago by zx87548.4k
4
gravatar for Kevin Blighe
5 weeks ago by
Kevin Blighe51k
Kevin Blighe51k wrote:

Hey, you can try the idea of Rashedul; however, using BEDTools, you may encounter issues with the sorting of the regions and also the fact that you implied how the co-ordinates in B can be repeated.

Here is a solution in trusty AWK:

cat A.txt 
V1  V2  V3
6   42721754    42721769
6   42721757    42721772
6   42721760    42721775
6   42721763    42721778
6   42721766    42721781
6   42721769    42721784
6   42721772    42721787
6   42721775    42721790

cat B.txt 
V2  AF
42721757    0.003067485
42721760    0.006134969
42721763    0.006134969
42721766    0.003067485
42721769    0.006134969
42721772    0.006134969
42721775    0.003067485
42721778    0.006134969
42721781    0.003067485
42721784    0.003067485
42721787    0.009202454
42721790    0.009202454

awk 'FNR==NR \
  {if (NR>1) {arr[$1]=$2}; next} \
  {if (FNR>1) \
    {for (pos in arr) \
      if ((pos > $2) && (pos < $3)) \
        print $0"\t"arr[pos]}}' FS='\t' B.txt FS='\t' A.txt 
6   42721754    42721769    0.003067485
6   42721754    42721769    0.006134969
6   42721754    42721769    0.006134969
6   42721754    42721769    0.003067485
6   42721757    42721772    0.006134969
6   42721757    42721772    0.006134969
6   42721757    42721772    0.003067485
6   42721757    42721772    0.006134969
6   42721760    42721775    0.006134969
6   42721760    42721775    0.003067485
6   42721760    42721775    0.006134969
6   42721760    42721775    0.006134969
6   42721763    42721778    0.003067485
6   42721763    42721778    0.006134969
6   42721763    42721778    0.006134969
6   42721763    42721778    0.003067485
6   42721766    42721781    0.006134969
6   42721766    42721781    0.006134969
6   42721766    42721781    0.003067485
6   42721766    42721781    0.006134969
6   42721769    42721784    0.006134969
6   42721769    42721784    0.003067485
6   42721769    42721784    0.006134969
6   42721769    42721784    0.003067485
6   42721772    42721787    0.003067485
6   42721772    42721787    0.006134969
6   42721772    42721787    0.003067485
6   42721772    42721787    0.003067485
6   42721775    42721790    0.006134969
6   42721775    42721790    0.003067485
6   42721775    42721790    0.003067485
6   42721775    42721790    0.009202454

This assumes that the files are tab-delimited and that they genuinely have the headers that you imply they have (the code instructs to skip the first line of each). If the files are not tab-delimited, then modify the values set for FS.

Also, it assumes that you want to match between the regions specified in A.txt, so, 'less than' or 'greater than', but not 'less than or equal to' or 'greater than or equal to'. Modify > and < in the code to suit your needs.

Kevin

ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by Kevin Blighe51k
4
gravatar for zx8754
5 weeks ago by
zx87548.4k
London
zx87548.4k wrote:

Using R data.table:

library(data.table)

# example data
A <- fread(text = "
V1     V2          V3
6   42721754    42721769
6   42721757    42721772
6   42721760    42721775
6   42721763    42721778
6   42721766    42721781
6   42721769    42721784
6   42721772    42721787
6   42721775    42721790")

B <- fread(text = "
V2              AF
42721757    0.003067485
42721760    0.006134969
42721763    0.006134969
42721766    0.003067485
42721769    0.006134969
42721772    0.006134969
42721775    0.003067485
42721778    0.006134969
42721781    0.003067485
42721784    0.003067485
42721787    0.009202454
42721790    0.009202454")


# set the same keys for both A and B, start and end positions.
# If we have multiple chromosomes, we need to include chrom as key, too.
setDT(A, key = c("V2", "V3"))
setDT(B)
B <- B[, .(V2, V3 = V2, AF)]
setkeyv(B, c("V2", "V3"))

# merge with overlap, check the "type" argument for merge types, see ?foverlaps
res <- foverlaps(B, A)
head(res)
#    V1       V2       V3     i.V2     i.V3          AF
# 1:  6 42721754 42721769 42721757 42721757 0.003067485
# 2:  6 42721757 42721772 42721757 42721757 0.003067485
# 3:  6 42721754 42721769 42721760 42721760 0.006134969
# 4:  6 42721757 42721772 42721760 42721760 0.006134969
# 5:  6 42721760 42721775 42721760 42721760 0.006134969
# 6:  6 42721754 42721769 42721763 42721763 0.006134969
ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by zx87548.4k

Didn't work.

"Error in setkeyv(B, c("V2", "V3")) : some columns are not in the data.table: V2"

Thanks though.

ADD REPLYlink written 5 weeks ago by eDNAuRNA20

With provided example data set it works just fine. Do not expect direct copy pasting would 100% work, try to understand what each line is doing.

ADD REPLYlink written 5 weeks ago by zx87548.4k

My apologies. You are absolutely right. With example data it works 100%. But its not working with real dataset. I realized that the real dataset has many regions in A.txt which doesn't overlap with the B.txt. Please note that in real dataset, A.txt has about 10,000 rows while B.txt has about 80 rows. Can we introduce NA or zero for the entries which are present in A.txt but absent in B.txt? Thanks for your help.

ADD REPLYlink written 4 weeks ago by eDNAuRNA20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1415 users visited in the last hour