Question: findOverlaps function in R
5
gravatar for SGMS
5.1 years ago by
SGMS70
European Union
SGMS70 wrote:

Hi everyone,

I have used the "findOverlaps" function in R to find which positions of my two datasets overlap. I have also used "countOverlaps" to see how many overlaps I have.

What I want to do now, is to find the exact coordinates which overlapped. I was looking how to do it using findOverlaps but it has no such options and by doing a google search, I couldn't find much help. 

A little help on this would be greatly appreciated. 

Thank you

ADD COMMENTlink modified 5.1 years ago by Sean Davis26k • written 5.1 years ago by SGMS70

I have switched to foverlaps (data.table package) whenever I working with genomic regions.

A fast binary-search based overlap join of two data.tables. This is very much inspired by findOverlaps function from the bioconductor package IRanges...

Simply intersect two data tables and extract overlapping region. Please post example data and I will show how it may be done.

ADD REPLYlink modified 14 months ago by Ram32k • written 5.1 years ago by PoGibas4.9k

Thank you for your reply Pgibas. Example data as required:

1st dataset:

1   668630  850204
1   713044  755966
1   738570  742020
1   766600  769112

2nd dataset:

1    149854870    149909434
1    17325224     17400665
1    19971779     20026055
1    24180459     24259817

Thanks again!

ADD REPLYlink modified 2.2 years ago by Ram32k • written 5.1 years ago by SGMS70

What is the expected output? Do you want a new set of regions that represents the overlaps between dataset 1 and 2?

ADD REPLYlink modified 2.2 years ago by Ram32k • written 5.1 years ago by Sean Davis26k

yes! exactly that!

ADD REPLYlink written 5.1 years ago by SGMS70

You datasets don't overlap.

ADD REPLYlink modified 14 months ago by Ram32k • written 5.1 years ago by PoGibas4.9k

I personally use mergeByOverlaps and subsetByOverlaps for these cases.

ADD REPLYlink written 5.1 years ago by Giovanni M Dall'Olio27k
8
gravatar for PoGibas
5.1 years ago by
PoGibas4.9k
Vilnius
PoGibas4.9k wrote:

I have done this with dummy data.

Lets say we have we have regions A (groupA) and regions B (groupB).

groupA <- data.table(
    chr = rep("chr1", 5),
    start = seq(0, 1000, 100),
    end = seq(50, 1050, 100))
setkey(groupA, chr, start, end)

groupB <- data.table(
    chr = rep("chr1", 5),
    start = seq(25, 1025, 100),
    end = seq(75, 1075, 100))
setkey(groupB, chr, start, end)

Check:

  1. If your datasets are data.table class(groupA), if not do setDT(groupA)
  2. If keys are chr start end, if not do setkey(groupA, chr, start, end)

# Find overlaps
over <- foverlaps(groupA, groupB, nomatch = 0)
# Extract exact regions
over2 <- data.table(
    chr = over$chr,
    start = over[, ifelse(start > i.start, start, i.start)],
    end = over[, ifelse(end < i.end, end, i.end)])
ADD COMMENTlink modified 2.2 years ago by Ram32k • written 5.1 years ago by PoGibas4.9k

Thanks so much for this. My coordinates actually overlap -my datasets are much larger than the above-. I used your code on my data and it works perfectly. I have double-checked the number of hits with findOverlaps and the number is the same, but now I also have the overlapped coordinates! Thanks again :)

ADD REPLYlink written 5.1 years ago by SGMS70

If it helped to solve your problem mark the answer :)

ADD REPLYlink written 5.1 years ago by PoGibas4.9k

@Pgibas If I want to label the values, for example in a new column I assign two values, "Exact" or "Partial" based on if the value in groupA exactly falls within the range of region in groupB and if there is a partial overlap, respectively. How can I do that?

ADD REPLYlink written 4.6 years ago by Bioinformatist Newbie250
7
gravatar for Sean Davis
5.1 years ago by
Sean Davis26k
National Institutes of Health, Bethesda, MD
Sean Davis26k wrote:

The findOverlaps() method is not the right tool for the job. Instead, you want the set intersection which can be obtained with intersect() on two GRanges objects.

# create two GRanges objects for testing
gr1 = GRanges(seqnames='chr1',ranges=IRanges(start=seq(1,1001,100),width=50))
gr2 = GRanges(seqnames='chr1',ranges=IRanges(start=seq(25,1025,100),width=50))

# and "intersect" them, finding the set of regions
# formed by the overlaps of gr1 and gr2
gr3 = intersect(gr1,gr2)

# and the result
gr3

GRanges object with 11 ranges and 0 metadata columns:
       seqnames       ranges strand
          <Rle>    <IRanges>  <Rle>
   [1]     chr1 [  25,   50]      *
   [2]     chr1 [ 125,  150]      *
   [3]     chr1 [ 225,  250]      *
   [4]     chr1 [ 325,  350]      *
   [5]     chr1 [ 425,  450]      *
   [6]     chr1 [ 525,  550]      *
   [7]     chr1 [ 625,  650]      *
   [8]     chr1 [ 725,  750]      *
   [9]     chr1 [ 825,  850]      *
  [10]     chr1 [ 925,  950]      *
  [11]     chr1 [1025, 1050]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
ADD COMMENTlink modified 2.2 years ago by Ram32k • written 5.1 years ago by Sean Davis26k

Thank you! A much easier solution! I'm wondering: I get 137 ranges, meaning we have 137 overlapped coordinates right?

When I was using countOverlaps though, it gave me 195. Shouldn't the number of these two be the same -despite the fact that "intersect" also gives us the coordinates-?

Thanks again!

ADD REPLYlink modified 14 months ago by Ram32k • written 5.1 years ago by SGMS70
1

If regions in dataset1 overlap with other regions in dataset1, they will end up being "collapsed" into a single region in the output of intersect(). This is the discrepancy with countOverlaps, which will count the overlapping regions from dataset1 twice.

ADD REPLYlink modified 14 months ago by Ram32k • written 5.1 years ago by Sean Davis26k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2730 users visited in the last hour
_