How to prevent reduce from being performed in GRanges during setdiff
3
2
Entering edit mode
19 months ago
Bosberg ▴ 50

In the genomic ranges package for R, is there an option in setdiff that prevents the collapsing of adjacent ranges? for example, If I have the following:

gr1 = GRanges object
[1]     chr1      1-10      *
[2]     chr1     11-20      *
[3]     chr1     21-30      *

gr2 = GRanges object
[1]     chr1     18-25      *


and I take setdiff:

What I get:

setdiff( gr1, gr2 )
[1]     chr1      1-17      *   # <-- 1-10 and 11-20 are reduced together automatically; I don't want this.
[2]     chr1     26-30      *


The output stops at 17 and picks up again at 26 to avoid gr2 from 18-25, so at least that much is correct, but unfortunately, my output now has a single continuous range from 1 to 17 instead of 1-10, and a different GRange directly adjacent from 11-17. There's a reduce operation being done automatically that I want to suppress.

What I want:

setdiff( gr1, gr2, <Some_option_to_suppress_reduce> )
[1]     chr1      1-10      *
[2]     chr1     11-17      *
[3]     chr1     26-30      *


I don’t want these first two regions to be reduced into one.

What I've tried:

The best solution I've come up with so far is to convert to a list and then back to GRange, like this:

unlist( GRangesList( lapply( 1:length(gr1), function(i) setdiff ( gr1[i], gr2) ) ))


..which does what I want but with all the converting between data types it's really slow and inefficient. Is there an option to turn off reduce directly (or some other more elegant solution)?

blunt-end genomicranges setdiff • 1.1k views
0
Entering edit mode

Apparently PyRanges does this by default with the function subtract() --which, I think is a much more logical default setting for something called 'subtract' (but for a function called 'setdiff()' I can understand this choice of default behavior; it would be nice if there were an equivalent 'subtract' function for GRanges though).

Solution=switch to PyRanges, or use function below.

2
Entering edit mode
19 months ago
Bosberg ▴ 50

Thanks to Rongxin for putting me on the right track here. I wrote the following function in case anyone encounters this in the future (editted March 22 to account for the fact that bedtools assumes base-0 start position; if anyone encounters problems with this, please reply and ping me.):

library(GenomicRanges)
library(bedtoolsr)
# Written by bosberg on Biostars, March 22, 2021. Free to use for scholarly purposes.

GRanges_subtract <- function( gr1, gr2 )
{
# Subtract GRange object gr2 from gr1, but unlike setdiff, preserve individual
# ranges in gr1
df_1 = data.frame( seqnames=seqnames(gr1), start=start(gr1)-1, end=end(gr1), strand=strand(gr1), mcols( gr1 ) )
df_2 = data.frame( seqnames=seqnames(gr2), start=start(gr2)-1, end=end(gr2), strand=strand(gr2), mcols( gr2 ) )
#                                                          ^ -1 --> convert to base-0 start for bedtools
result = bedtoolsr::bt.subtract(df_1, df_2)

if ( length(result)==0 ){
# subtraction has left nothing remaining. Return empty GRanges obj.
return( GRanges() )
} else {

colnames( result ) = colnames( df_1 )
result$start=result$start+1
#                        ^ reset to base-1 notation consistent with GRanges

return ( GRanges( result ) )
}
}

1
Entering edit mode
19 months ago
Rongxin ▴ 40

Hello Bosberg,

I understand your question. I am sorry that I have no idea about using setdiff to solve this problem.

However, I would like to recommend you an alternative R package (bedtoolsr) which may solve this problem.

Example code

install.packages("devtools")
devtools::install_github("PhanstielLab/bedtoolsr")

library(GenomicRanges)
library(bedtoolsr)

A <- GRanges("chr1", IRanges(c(1, 11, 21), c(10, 20, 30)))
B <- GRanges("chr1", IRanges(18, 25))

A.2 <- data.frame(seqnames=seqnames(A), starts=start(A), ends=end(A))
B.2 <- data.frame(seqnames=seqnames(B), starts=start(B) - 1, ends=end(B) + 1) # boundary should be handled carefully
print(bedtoolsr::bt.subtract(A.2, B.2))
# result
#    V1 V2 V3
#1 chr1  1 10
#2 chr1 11 17
#3 chr1 26 30
# then convert data.frame to GRanges

0
Entering edit mode

Thanks Rongxin. I added a few lines to make it a closed function to call above. I also found some edge cases that indicated to me it's better to shift the boundary on the start position for both incoming dataFrames (as in the function I provided above). Your general idea was right though, and it led to my solution, so thanks very much.

0
Entering edit mode
11 months ago
mrtn • 0

Here is a solution based on R/GenomicRanges only:

subtract.GRanges <- function(x, y){
stopifnot(isDisjoint(x))
z <- disjoin(c(x, y))
return(z[!overlapsAny(z, y)])
}


Lets try it on your input:

x <- GRanges(c('chr1:1-10', 'chr1:11-20', 'chr1:21-30'))
y <- GRanges('chr1:18-25')
subtract.GRanges(x, y)


The output is:

GRanges object with 3 ranges and 0 metadata columns:
seqnames    ranges strand
<Rle> <IRanges>  <Rle>
[1]     chr1      1-10      *
[2]     chr1     11-17      *
[3]     chr1     26-30      *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths


Here a graphical representation of how it works:

First the two sets of GRanges x and y are combined and disjoined so that we get new GRanges z with all regions and breakpoints present in x and y. Then, all ranges in z that overlap y are removed to get the final result.

One caveat is that the method only works if the ranges in x are disjoint. Disjoint means that there must not be any overlapping ranges within a GRanges object. If this method would be applied on x with overlapping ranges the result might not be as expected, so the function above checks this condition.

0
Entering edit mode

Thanks for suggestion mrtn, this is great if indeed one can assume no overlapping ranges (e.g. as in my example where 1-10 and 11-20 are adjacent but not overlapping). In general though, I think a "subtract" function should be robust against possible overlapping regions --hence, if I had regions of, say, 1-12, and 8-20, there may still be some reason why I've partitioned them as such and want to keep them distinct. I think the minimal operation of "subtract" should not merge these ranges together by default.