Question

Finding overlapping ranges in R

2

Entering edit mode

7.9 years ago

EVR ▴ 610

Hi,

I have a set of intervals in a data frame and a query interval range. All I want to find the interval ranges that not only overlap with query ranges but also subsequent ranges. For an example, consider the data frame like follows:

df=data.frame(Id=rep("A1",23),start=c(11176,11176,11176,11176,11176,11176,11176,11177,11177,11177,11177,11177,11177,11178,11178,11179,11179,11179,11233,11233,11233,11233,11233),end=11205,11206,11206,11206,11206,11206,11207,11206,11206,11208,11206,11208,11209,11206,11206,11203,11204,11204,11263,11263,11263,11263,11264))

If my query range interval is 11176 and 11205. Then in the data frame df, I would like find the intervals that overlap my query interval range and also intervals that overlap the overlapping intervals of query range.

Below is my R code but for some reasons it is not giving me the output I desire. I expect the output 11179 and 11204 but some how my code is outputting only the range 11178 and 11206.

temp_start= 11176
temp_end=11206
for(i in 1:dim(df)[1])
{
  final_start=temp_start
  final_end=temp_end
 if((findInterval(final_end,c(df$start[i],df$end[i]),rightmost.closed = T,left.open = T)==1L) || (findInterval(final_start,c(df$start[i],df$end[i]),rightmost.closed = T,left.open = T)==1L))
   {
    final_start=df$start[i]
    final_end=df$end[i]
    print(final_start)
    print(final_end)
      } 
}

The above code take the query_start(11176) and query_end(11206) as input. Later I check either the temp_start or temp_end must be be within the ranges of the interval ranges in data frame df. If it is then this interval range is taken and being checked whether this interval's range start or end must be within the range of next interval range in for loop.

Any guidance would be highly appreciated. thanks in advance.

RNA-Seq R Overlapping-ranges • 9.0k views

ADD COMMENT • link updated 7.9 years ago by poisonAlien ★ 3.2k • written 7.9 years ago by EVR ▴ 610

0

Entering edit mode

There is a typo in your first example, you should add "=c(" after "end" in the declaration of df. Moreover this dataframe seems to contain many identical duplicated entries. In any case I would recommend you GRanges for working with genomic ranges.

ADD REPLY • link 7.9 years ago by Giovanni M Dall'Olio 28k

score 1 · Answer 1 · 2016-06-13

Check out the IRanges and GRanges packages in R.

See also Partial or complete overlap of two genomic ranges

Then in the data frame df, I would like find the intervals that overlap my query interval range and also intervals that overlap the overlapping intervals of query range.

This can be achieved by running the findoverlaps query against itself, then iterating over the result and generating the 1. order self-overlapping extension of the query by computing the normalized intervals for each query and the self overlap.

score 0 · Answer 2 · 2016-06-13

0

Entering edit mode

7.9 years ago

H.Hasani ▴ 990

Similar to IRanges and GRanges, you can try genomeIntervals

ADD COMMENT • link 7.9 years ago by H.Hasani ▴ 990

score 0 · Answer 3 · 2016-06-13

As suggested by others, use GRanges for genomic ranges intersections.

> df=data.frame(Id=rep("A1",23),start=c(11176,11176,11176,11176,11176,11176,11176,11177,11177,11177,11177,11177,11177,11178,11178,11179,11179,11179,11233,11233,11233,11233,11233),end=c(11205,11206,11206,11206,11206,11206,11207,11206,11206,11208,11206,11208,11209,11206,11206,11203,11204,11204,11263,11263,11263,11263,11264))
> gr = makeGRangesFromDataFrame(df, seqnames.field="Id")
> gr
GRanges object with 23 ranges and 0 metadata columns:
       seqnames         ranges strand
          <Rle>      <IRanges>  <Rle>
   [1]       A1 [11176, 11205]      *
   [2]       A1 [11176, 11206]      *
   [3]       A1 [11176, 11206]      *
   [4]       A1 [11176, 11206]      *
   [5]       A1 [11176, 11206]      *
   ...      ...            ...    ...
  [19]       A1 [11233, 11263]      *
  [20]       A1 [11233, 11263]      *
  [21]       A1 [11233, 11263]      *
  [22]       A1 [11233, 11263]      *
  [23]       A1 [11233, 11264]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

> gr %>% unique
GRanges object with 11 ranges and 0 metadata columns:
       seqnames         ranges strand
          <Rle>      <IRanges>  <Rle>
   [1]       A1 [11176, 11205]      *
   [2]       A1 [11176, 11206]      *
   [3]       A1 [11176, 11207]      *
   [4]       A1 [11177, 11206]      *
   [5]       A1 [11177, 11208]      *
   [6]       A1 [11177, 11209]      *
   [7]       A1 [11178, 11206]      *
   [8]       A1 [11179, 11203]      *
   [9]       A1 [11179, 11204]      *
  [10]       A1 [11233, 11263]      *
  [11]       A1 [11233, 11264]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seq

> query.gr = GRanges("A1", IRanges(start=11176, end=11205))
> subsetByOverlaps(gr, uniquequery.gr))
GRanges object with 18 ranges and 0 metadata columns:
       seqnames         ranges strand
          <Rle>      <IRanges>  <Rle>
   [1]       A1 [11176, 11205]      *
   [2]       A1 [11176, 11206]      *
   [3]       A1 [11176, 11206]      *
   [4]       A1 [11176, 11206]      *
   [5]       A1 [11176, 11206]      *
   ...      ...            ...    ...
  [14]       A1 [11178, 11206]      *
  [15]       A1 [11178, 11206]      *
  [16]       A1 [11179, 11203]      *
  [17]       A1 [11179, 11204]      *
  [18]       A1 [11179, 11204]      *

score 0 · Answer 4 · 2016-06-13

0

Entering edit mode

7.9 years ago

poisonAlien ★ 3.2k

GRanges and IRanges are okay if your data is small. But its too slow in case of larger datasets !

Use foverlaps from data.table if your data is huge. Its crazy fast

ADD COMMENT • link 7.9 years ago by poisonAlien ★ 3.2k