Question: Finding overlaps between GRanges objects
0
gravatar for mickey_95
3 months ago by
mickey_9530
mickey_9530 wrote:

Hi,

I have a list of ChIP-seq peak regions and I am curious about those peaks overlapping annotated promoter regions. Inspired by this tutorial

http://cbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/chip-seq_exercises.html#functional-annotation-of-chip-seq-peaks

(more specifically, sections 5.3 and 5.4), I have come up with two approaches using two different findOverlaps methods (Genomic Alignments R package) that leave me with slightly different and thus confusing results:

Approach 1: Using subsetByOverlaps()

# retrieving promoter regions    
promoters <- promoters(genes, upstream=2000, downstream=200)

# retrieving peaks that overlap a promoter region
peaksProm <- subsetByOverlaps(peaks, promoters) # 1695 ranges

# retrieving genes that have a peak overlapping their promoter region
genesWithPromPeaks <- subsetByOverlaps(promoters, peaks) # 2231 ranges

Approach 2: Using findOverlaps()

# retrieving promoter regions
promoters <- promoters(genes, upstream=2000, downstream=200)

PromPeaks <- findOverlaps(peaks, prom) # 2246 hits

# retrieving peaks that overlap a promoter region
peaksProm <- peaks[queryHits(PromPeaks)] # 2246 ranges

# retrieving genes that have a peak overlapping their promoter region
genesWithPromPeaks <- genes[subjectHits(PromPeaks)] # 2246 ranges

Approach 2 gives the same number of genes and peaks, which is what I would expect. I am confused by the different results between the two approaches, but also by the different number of genes and peaks retrieved with approach 1. I was expecting both functions to yield the same results, but I might be missing something here. I would be thankful for any input and help!

ADD COMMENTlink modified 3 months ago by Jimbou820 • written 3 months ago by mickey_9530
3
gravatar for rpolicastro
3 months ago by
rpolicastro3.9k
Bloomington, IN
rpolicastro3.9k wrote:

Example data.

library("GenomicRanges")

gr1 <- GRanges(seqnames="I", ranges=IRanges(start=seq(10, 30, 10), width=5), strand="+")
> gr1
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        I     10-14      +
  [2]        I     20-24      +
  [3]        I     30-34      +
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

gr2 <- GRanges(seqnames="I", ranges=IRanges(start=c(11, 13, 40), width=2), strand="+")    
> gr2
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        I     11-12      +
  [2]        I     13-14      +
  [3]        I     40-41      +
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

findOverlaps will return an index in each sample for every overlap. For the case where there are multiple overlaps for a range, there is an index created for each overlap. This is why changing the order of the subject and query doesn't change the number of rows.

> findOverlaps(gr1, gr2)
Hits object with 2 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           1
  [2]         1           2
  -------
  queryLength: 3 / subjectLength: 3

subsetByOverlap returns only the ranges in the first object that have overlaps with any ranges in the second object. The order of the subject and query matters in the number of rows returned. For example, when one query range overlaps two subject ranges, only one row is returned. However, if you flip the subject and query ranges, for that particular overlap two rows would then be returned.

> subsetByOverlaps(gr1, gr2)
GRanges object with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        I     10-14      +
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

For more information on these functions refer to their help documentation.

ADD COMMENTlink modified 3 months ago • written 3 months ago by rpolicastro3.9k
2
gravatar for Jimbou
3 months ago by
Jimbou820
Germany
Jimbou820 wrote:

No need for GenomicRanges objects. Try package valr which depends a lot on tidyverse

library(tidyverse)
library(valr)

# transform example data from above to data.frame and add required column `"chrom"`
gr1 <-   as.data.frame(gr1) %>% 
  mutate(chrom = "chr1") %>% 
  select(chrom, start, end) 
gr2 <- as.data.frame(gr2) %>% 
  mutate(chrom = "chr1") %>% 
  select(chrom, start, end) 

# get intersect e.g overlap  

valr::bed_intersect(gr1, gr2, suffix = c("_gr1", "_gr2"))
# A tibble: 2 x 6
  chrom start_gr1 end_gr1 start_gr2 end_gr2 .overlap
  <chr>     <int>   <int>     <int>   <int>    <int>
1 chr1         10      14        11      12        1
2 chr1         10      14        13      14        1


# and plot
bed_glyph(bed_intersect(gr1, gr2))
ADD COMMENTlink modified 3 months ago • written 3 months ago by Jimbou820
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1153 users visited in the last hour
_