Question

Finding overlaps between GRanges objects

2

Entering edit mode

3.7 years ago

mickey_95 ▴ 110

Hi,

I have a list of ChIP-seq peak regions and I am curious about those peaks overlapping annotated promoter regions. Inspired by this tutorial

http://cbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/chip-seq_exercises.html#functional-annotation-of-chip-seq-peaks

(more specifically, sections 5.3 and 5.4), I have come up with two approaches using two different findOverlaps methods (Genomic Alignments R package) that leave me with slightly different and thus confusing results:

Approach 1: Using subsetByOverlaps()

# retrieving promoter regions    
promoters <- promoters(genes, upstream=2000, downstream=200)

# retrieving peaks that overlap a promoter region
peaksProm <- subsetByOverlaps(peaks, promoters) # 1695 ranges

# retrieving genes that have a peak overlapping their promoter region
genesWithPromPeaks <- subsetByOverlaps(promoters, peaks) # 2231 ranges

Approach 2: Using findOverlaps()

# retrieving promoter regions
promoters <- promoters(genes, upstream=2000, downstream=200)

PromPeaks <- findOverlaps(peaks, prom) # 2246 hits

# retrieving peaks that overlap a promoter region
peaksProm <- peaks[queryHits(PromPeaks)] # 2246 ranges

# retrieving genes that have a peak overlapping their promoter region
genesWithPromPeaks <- genes[subjectHits(PromPeaks)] # 2246 ranges

Approach 2 gives the same number of genes and peaks, which is what I would expect. I am confused by the different results between the two approaches, but also by the different number of genes and peaks retrieved with approach 1. I was expecting both functions to yield the same results, but I might be missing something here. I would be thankful for any input and help!

GRanges promoter peak IRanges ChIP-Seq • 12k views

ADD COMMENT • link updated 3.7 years ago by Jimbou ▴ 950 • written 3.7 years ago by mickey_95 ▴ 110

score 8 · Answer 1 · 2020-11-13

No need for GenomicRanges objects. Try package valr which depends a lot on tidyverse

library(tidyverse)
library(valr)

# transform example data from above to data.frame and add required column `"chrom"`
gr1 <-   as.data.frame(gr1) %>% 
  mutate(chrom = "chr1") %>% 
  select(chrom, start, end) 
gr2 <- as.data.frame(gr2) %>% 
  mutate(chrom = "chr1") %>% 
  select(chrom, start, end) 

# get intersect e.g overlap  

valr::bed_intersect(gr1, gr2, suffix = c("_gr1", "_gr2"))
# A tibble: 2 x 6
  chrom start_gr1 end_gr1 start_gr2 end_gr2 .overlap
  <chr>     <int>   <int>     <int>   <int>    <int>
1 chr1         10      14        11      12        1
2 chr1         10      14        13      14        1


# and plot
bed_glyph(bed_intersect(gr1, gr2))

score 7 · Answer 2 · 2020-11-13

Example data.

library("GenomicRanges")

gr1 <- GRanges(seqnames="I", ranges=IRanges(start=seq(10, 30, 10), width=5), strand="+")
> gr1
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        I     10-14      +
  [2]        I     20-24      +
  [3]        I     30-34      +
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

gr2 <- GRanges(seqnames="I", ranges=IRanges(start=c(11, 13, 40), width=2), strand="+")    
> gr2
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        I     11-12      +
  [2]        I     13-14      +
  [3]        I     40-41      +
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

findOverlaps will return an index in each sample for every overlap. For the case where there are multiple overlaps for a range, there is an index created for each overlap. This is why changing the order of the subject and query doesn't change the number of rows.

> findOverlaps(gr1, gr2)
Hits object with 2 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           1
  [2]         1           2
  -------
  queryLength: 3 / subjectLength: 3

subsetByOverlap returns only the ranges in the first object that have overlaps with any ranges in the second object. The order of the subject and query matters in the number of rows returned. For example, when one query range overlaps two subject ranges, only one row is returned. However, if you flip the subject and query ranges, for that particular overlap two rows would then be returned.

> subsetByOverlaps(gr1, gr2)
GRanges object with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        I     10-14      +
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

For more information on these functions refer to their help documentation.