Question

Evaluate overlap of regions

1

Entering edit mode

5.4 years ago

franc.jian ▴ 40

Dear all,

I am struggling with what should probably be a trivial problem. I have 2 seg files with the following structure:

chr    start   end  feature.1  feature.2 
1       1000  1200   0.5          0.001


chr    start   end  feature.1  feature.2  
1       1100  1400   0.7          0.004

Let's call them A and B. I want to obtain a unique file like this:

chr    start   end     feature.1.a      feature.1.b
1       1000    1100     0.5             NA
1       1100    1200       0.5            0.7
1       1200    1400     NA             0.7

If a region is covered by only A or B, then it would contain only the corresponding values and it would have NA in the other fields. Otherwise, I want to retrieve both of them. This looks close to what bedtools -intersect does, and indeed I am trying to use the package Bedr to solve this task, but it seems to leave out regions covered only in one file. Maybe GenomicRanges is another possibility. If you have any hint on this it would be very helpful to me. Thank you

genome R • 1.3k views

ADD COMMENT • link updated 5.4 years ago by benformatics 3.9k • written 5.4 years ago by franc.jian ▴ 40

1

Entering edit mode

See if this example in this answer helps:

A: Struggling to filter data with R, tidyverse

ADD REPLY • link 5.4 years ago by zx8754 11k

0

Entering edit mode

Thank you, even though the operation is not symmetrical it definitely helped out with my problem. I can add it as a solution myself or you can add it and I will close the post. Thank you again.

ADD REPLY • link 5.4 years ago by franc.jian ▴ 40

0

Entering edit mode

Feel free to answer your own question, and accept as answer.

ADD REPLY • link 5.4 years ago by zx8754 11k

1

Entering edit mode

There's a --partition option in BEDOPS bedops that solves this pretty easily. Some people don't seem to like using this kit for some reason, but if you're interested, let me know and I'll put together an answer.

ADD REPLY • link 5.4 years ago by Alex Reynolds 35k

1

Entering edit mode

5.4 years ago

benformatics 3.9k

Given two GenomicRanges objects a and b.

library(GenomicRanges)
a <- GRanges("chr1:1000-1200")
b <- GRanges("chr1:1100-1400")
a$feature.1 <- 0.5
b$feature.1 <- 0.7

get.overlap.info <- function(a,b){
   overlap.r <- intersect(a,b)
   different.r1 <- setdiff(a,b)
   different.r2 <- setdiff(b,a)
   different.r <- c(different.r1,different.r2)
   full.r <- c(different.r,overlap.r)
   ol.a <- findOverlaps(full.r,a)
   ol.b <- findOverlaps(full.r,b)
   full.r$feature.1.a <- NA
   full.r$feature.1.b <- NA
   full.r$feature.1.a[queryHits(ol.a)] <- a$feature.1[subjectHits(ol.a)]
   full.r$feature.1.b[queryHits(ol.b)] <- b$feature.1[subjectHits(ol.b)]
   return(full.r)
}

new.ranges <- get.overlap.info(a,b)
new.ranges
GRanges object with 3 ranges and 2 metadata columns:
      seqnames    ranges strand | feature.1.a feature.1.b
         <Rle> <IRanges>  <Rle> |   <numeric>   <numeric>
  [1]     chr1 1000-1099      * |         0.5        <NA>
  [2]     chr1 1201-1400      * |        <NA>         0.7
  [3]     chr1 1100-1200      * |         0.5         0.7
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

ADD COMMENT • link 5.4 years ago by benformatics 3.9k

0

Entering edit mode

you could also use sort() on that final object so that it is sorted

ADD REPLY • link 5.4 years ago by benformatics 3.9k

score 2 · Accepted Answer · 2018-11-23

2

Entering edit mode

5.4 years ago

franc.jian ▴ 40

I managed to deal with the issue using data.table::foverlaps. I report an example from another thread.

A: Struggling to filter data with R, tidyverse

The solution needs to be adapted but essentially solves the problem.

ADD COMMENT • link 5.4 years ago by franc.jian ▴ 40