Question: Finding overlapping ranges in R
2
2.5 years ago by
EVR510
Earth
EVR510 wrote:

Hi,

I have a set of intervals in a data frame and a query interval range. All I want to find the interval ranges that not only overlap with query ranges but also subsequent ranges. For an example, consider the data frame like follows:

``````df=data.frame(Id=rep("A1",23),start=c(11176,11176,11176,11176,11176,11176,11176,11177,11177,11177,11177,11177,11177,11178,11178,11179,11179,11179,11233,11233,11233,11233,11233),end=11205,11206,11206,11206,11206,11206,11207,11206,11206,11208,11206,11208,11209,11206,11206,11203,11204,11204,11263,11263,11263,11263,11264))
``````

If my query range interval is `11176` and `11205`. Then in the data frame df, I would like find the intervals that overlap my query interval range and also intervals that overlap the overlapping intervals of query range.

Below is my R code but for some reasons it is not giving me the output I desire. I expect the output `11179` and `11204` but some how my code is outputting only the range `11178` and `11206`.

``````temp_start= 11176
temp_end=11206
for(i in 1:dim(df)[1])
{
final_start=temp_start
final_end=temp_end
if((findInterval(final_end,c(df\$start[i],df\$end[i]),rightmost.closed = T,left.open = T)==1L) || (findInterval(final_start,c(df\$start[i],df\$end[i]),rightmost.closed = T,left.open = T)==1L))
{
final_start=df\$start[i]
final_end=df\$end[i]
print(final_start)
print(final_end)
}
}
``````

The above code take the query_start(`11176`) and query_end(`11206`) as input. Later I check either the temp_start or temp_end must be be within the ranges of the interval ranges in data frame `df`. If it is then this interval range is taken and being checked whether this interval's range start or end must be within the range of next interval range in for loop.

Any guidance would be highly appreciated. thanks in advance.

rna-seq overlapping-ranges R • 2.4k views
modified 2.5 years ago by poisonAlien2.6k • written 2.5 years ago by EVR510

There is a typo in your first example, you should add "=c(" after "end" in the declaration of df. Moreover this dataframe seems to contain many identical duplicated entries. In any case I would recommend you GRanges for working with genomic ranges.

1
2.5 years ago by
Bergen, Norway
Michael Dondrup45k wrote:

Check out the IRanges and GRanges packages in R.

Then in the data frame df, I would like find the intervals that overlap my query interval range and also intervals that overlap the overlapping intervals of query range.

This can be achieved by running the findoverlaps query against itself, then iterating over the result and generating the 1. order self-overlapping extension of the query by computing the normalized intervals for each query and the self overlap.

0
2.5 years ago by
H.Hasani630
Freiburg, Germany
H.Hasani630 wrote:

Similar to IRanges and GRanges, you can try genomeIntervals

0
2.5 years ago by
London, UK
Giovanni M Dall'Olio26k wrote:

As suggested by others, use GRanges for genomic ranges intersections.

``````> df=data.frame(Id=rep("A1",23),start=c(11176,11176,11176,11176,11176,11176,11176,11177,11177,11177,11177,11177,11177,11178,11178,11179,11179,11179,11233,11233,11233,11233,11233),end=c(11205,11206,11206,11206,11206,11206,11207,11206,11206,11208,11206,11208,11209,11206,11206,11203,11204,11204,11263,11263,11263,11263,11264))
> gr = makeGRangesFromDataFrame(df, seqnames.field="Id")
> gr
GRanges object with 23 ranges and 0 metadata columns:
seqnames         ranges strand
<Rle>      <IRanges>  <Rle>
[1]       A1 [11176, 11205]      *
[2]       A1 [11176, 11206]      *
[3]       A1 [11176, 11206]      *
[4]       A1 [11176, 11206]      *
[5]       A1 [11176, 11206]      *
...      ...            ...    ...
[19]       A1 [11233, 11263]      *
[20]       A1 [11233, 11263]      *
[21]       A1 [11233, 11263]      *
[22]       A1 [11233, 11263]      *
[23]       A1 [11233, 11264]      *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths

> gr %>% unique
GRanges object with 11 ranges and 0 metadata columns:
seqnames         ranges strand
<Rle>      <IRanges>  <Rle>
[1]       A1 [11176, 11205]      *
[2]       A1 [11176, 11206]      *
[3]       A1 [11176, 11207]      *
[4]       A1 [11177, 11206]      *
[5]       A1 [11177, 11208]      *
[6]       A1 [11177, 11209]      *
[7]       A1 [11178, 11206]      *
[8]       A1 [11179, 11203]      *
[9]       A1 [11179, 11204]      *
[10]       A1 [11233, 11263]      *
[11]       A1 [11233, 11264]      *
-------
seqinfo: 1 sequence from an unspecified genome; no seq

> query.gr = GRanges("A1", IRanges(start=11176, end=11205))
> subsetByOverlaps(gr, uniquequery.gr))
GRanges object with 18 ranges and 0 metadata columns:
seqnames         ranges strand
<Rle>      <IRanges>  <Rle>
[1]       A1 [11176, 11205]      *
[2]       A1 [11176, 11206]      *
[3]       A1 [11176, 11206]      *
[4]       A1 [11176, 11206]      *
[5]       A1 [11176, 11206]      *
...      ...            ...    ...
[14]       A1 [11178, 11206]      *
[15]       A1 [11178, 11206]      *
[16]       A1 [11179, 11203]      *
[17]       A1 [11179, 11204]      *
[18]       A1 [11179, 11204]      *
``````
0
2.5 years ago by
poisonAlien2.6k
Asgard
poisonAlien2.6k wrote:

GRanges and IRanges are okay if your data is small. But its too slow in case of larger datasets !

Use foverlaps from data.table if your data is huge. Its crazy fast