How to prevent reduce from being performed in GRanges during setdiff
3
2
Entering edit mode
10 months ago
Bosberg ▴ 40

In the genomic ranges package for R, is there an option in setdiff that prevents the collapsing of adjacent ranges? for example, If I have the following:

gr1 = GRanges object 
  [1]     chr1      1-10      *
  [2]     chr1     11-20      *
  [3]     chr1     21-30      *

gr2 = GRanges object
  [1]     chr1     18-25      *

and I take setdiff:

What I get:

setdiff( gr1, gr2 )
  [1]     chr1      1-17      *   # <-- 1-10 and 11-20 are reduced together automatically; I don't want this.
  [2]     chr1     26-30      *

The output stops at 17 and picks up again at 26 to avoid gr2 from 18-25, so at least that much is correct, but unfortunately, my output now has a single continuous range from 1 to 17 instead of 1-10, and a different GRange directly adjacent from 11-17. There's a reduce operation being done automatically that I want to suppress.

What I want:

setdiff( gr1, gr2, <Some_option_to_suppress_reduce> )
  [1]     chr1      1-10      *
  [2]     chr1     11-17      *
  [3]     chr1     26-30      *

I don’t want these first two regions to be reduced into one.

What I've tried:

The best solution I've come up with so far is to convert to a list and then back to GRange, like this:

unlist( GRangesList( lapply( 1:length(gr1), function(i) setdiff ( gr1[i], gr2) ) ))

..which does what I want but with all the converting between data types it's really slow and inefficient. Is there an option to turn off reduce directly (or some other more elegant solution)?

blunt-end genomicranges setdiff • 538 views
ADD COMMENT
0
Entering edit mode

Apparently PyRanges does this by default with the function subtract() --which, I think is a much more logical default setting for something called 'subtract' (but for a function called 'setdiff()' I can understand this choice of default behavior; it would be nice if there were an equivalent 'subtract' function for GRanges though).

https://pyranges.readthedocs.io/en/latest/autoapi/pyranges/index.html#pyranges.PyRanges.subtract

Solution=switch to PyRanges, or use function below.

ADD REPLY
2
Entering edit mode
9 months ago
Bosberg ▴ 40

Thanks to Rongxin for putting me on the right track here. I wrote the following function in case anyone encounters this in the future (editted March 22 to account for the fact that bedtools assumes base-0 start position; if anyone encounters problems with this, please reply and ping me.):

library(GenomicRanges)
library(bedtoolsr)
# Written by bosberg on Biostars, March 22, 2021. Free to use for scholarly purposes.

GRanges_subtract <- function( gr1, gr2 )
{
  # Subtract GRange object gr2 from gr1, but unlike setdiff, preserve individual
  # ranges in gr1
  df_1 = data.frame( seqnames=seqnames(gr1), start=start(gr1)-1, end=end(gr1), strand=strand(gr1), mcols( gr1 ) )
  df_2 = data.frame( seqnames=seqnames(gr2), start=start(gr2)-1, end=end(gr2), strand=strand(gr2), mcols( gr2 ) )
  #                                                          ^ -1 --> convert to base-0 start for bedtools
  result = bedtoolsr::bt.subtract(df_1, df_2)

  if ( length(result)==0 ){
    # subtraction has left nothing remaining. Return empty GRanges obj.
    return( GRanges() )
  } else {

    colnames( result ) = colnames( df_1 )
    result$start=result$start+1
    #                        ^ reset to base-1 notation consistent with GRanges

    return ( GRanges( result ) )
  }
}
ADD COMMENT
1
Entering edit mode
9 months ago
Rongxin ▴ 30

Hello Bosberg,

I understand your question. I am sorry that I have no idea about using setdiff to solve this problem.

However, I would like to recommend you an alternative R package (bedtoolsr) which may solve this problem.

Example code

install.packages("devtools")
devtools::install_github("PhanstielLab/bedtoolsr")

library(GenomicRanges)
library(bedtoolsr)

A <- GRanges("chr1", IRanges(c(1, 11, 21), c(10, 20, 30)))
B <- GRanges("chr1", IRanges(18, 25))

A.2 <- data.frame(seqnames=seqnames(A), starts=start(A), ends=end(A))
B.2 <- data.frame(seqnames=seqnames(B), starts=start(B) - 1, ends=end(B) + 1) # boundary should be handled carefully
print(bedtoolsr::bt.subtract(A.2, B.2))
# result
#    V1 V2 V3
#1 chr1  1 10
#2 chr1 11 17
#3 chr1 26 30
# then convert data.frame to GRanges
ADD COMMENT
0
Entering edit mode

Thanks Rongxin. I added a few lines to make it a closed function to call above. I also found some edge cases that indicated to me it's better to shift the boundary on the start position for both incoming dataFrames (as in the function I provided above). Your general idea was right though, and it led to my solution, so thanks very much.

ADD REPLY
0
Entering edit mode
5 weeks ago
mrtn • 0

Here is a solution based on R/GenomicRanges only:

subtract.GRanges <- function(x, y){
  stopifnot(isDisjoint(x))
  z <- disjoin(c(x, y))
  return(z[!overlapsAny(z, y)])
}

Lets try it on your input:

x <- GRanges(c('chr1:1-10', 'chr1:11-20', 'chr1:21-30'))
y <- GRanges('chr1:18-25')
subtract.GRanges(x, y)

The output is:

GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      1-10      *
  [2]     chr1     11-17      *
  [3]     chr1     26-30      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Here a graphical representation of how it works:

Graphical representation of the subtract.GRanges()

First the two sets of GRanges x and y are combined and disjoined so that we get new GRanges z with all regions and breakpoints present in x and y. Then, all ranges in z that overlap y are removed to get the final result.

One caveat is that the method only works if the ranges in x are disjoint. Disjoint means that there must not be any overlapping ranges within a GRanges object. If this method would be applied on x with overlapping ranges the result might not be as expected, so the function above checks this condition.

ADD COMMENT
0
Entering edit mode

Thanks for suggestion mrtn, this is great if indeed one can assume no overlapping ranges (e.g. as in my example where 1-10 and 11-20 are adjacent but not overlapping). In general though, I think a "subtract" function should be robust against possible overlapping regions --hence, if I had regions of, say, 1-12, and 8-20, there may still be some reason why I've partitioned them as such and want to keep them distinct. I think the minimal operation of "subtract" should not merge these ranges together by default.

ADD REPLY

Login before adding your answer.

Traffic: 2291 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6