Question

GenomicRanges : How to reduce a GRanges by gene

2

Entering edit mode

4.8 years ago

Nicolas Rosewick 10k

Hi,

I've a GRanges representing a intervals for all genes in the genome. A lot of these intervals are overlapping. I would like to use the reduce() from GenomicRanges package in order to make a non-overlapping set of interval. However I would like to do it for each gene separately. Thus for one specific gene, intervals for this gene should not overlap ; but intervals for different genes may overlap. One solution would like to split the GRanges by gene and apply reduce() on each subset but I'm wondering if there is a more efficient way ?

Thanks

Actual data

chrom   start   end hgnc
1   100 200 MYC
1   150 300 MYC
1   400 500 MYC
1   150 230 TP53
1   200 350 TP53
1   420 550 TP53

expected result

chrom   start   end hgnc
1   100 300 MYC
1   400 500 MYC
1   150 350 TP53
1   420 550 TP53

My actual solution :

# gene is the dataframe used to create the initial GRanges
 do.call(rbind,lapply(
  split(gene,gene$hgnc),
  function(x){
    as.data.frame(
      reduce(
        GRanges(x$chrom,IRanges(x$start,x$end))))}))

genomicranges granges reduce R • 9.3k views

ADD COMMENT • link updated 4.2 years ago by ATpoint 81k • written 4.8 years ago by Nicolas Rosewick 10k

2

Entering edit mode

Do you expect 2 rows for this example data? If yes, then group by hgnc, get min(start) max(end) ?

library(data.table)

x[ , list(chr = head(chrom, 1), start = min(start), end = max(end)), by = hgnc]
#    hgnc chr     start       end
# 1:  MYC   8 128747680 128753674
# 2: TP53  17   7565097   7590856

ADD REPLY • link 4.8 years ago by zx8754 11k

0

Entering edit mode

The example is maybe not the best indeed. In this case I expect two lines yes. But I can have more than 1 line per gene in the end (if there is multiple non-overlapping intervals ; that's why I use the reduce() function )

ADD REPLY • link 4.8 years ago by Nicolas Rosewick 10k

0

Entering edit mode

Please provide better data.

As long as genes do not overlap, simply doing this would work, too? reduce(GRanges(x$chrom, IRanges(x$start, x$end)))

ADD REPLY • link 4.8 years ago by zx8754 11k

0

Entering edit mode

Just changed with dummy data more suited to the question. and expect result.

ADD REPLY • link 4.8 years ago by Nicolas Rosewick 10k

8

Entering edit mode

4.2 years ago

mike.deberardine ▴ 110

For anyone still looking for a solution, this is much faster than previous responses. Starting from the dataframe (x):

gr <- GRanges(x)
# in short:
final <- unlist(reduce(split(gr, gr$hgnc)))

Here is in more detail, with description of each step

# split into GRangesList
grl <- split(gr, gr$hgnc)
# all(names(grl) == sort(unique(gr$hgnc))

# reduce each element (independently reduce ranges for each gene)
grl_redux <- reduce(grl) # element-wise, like lapply(grl, reduce) 
# all(names(grl_redux) == names(grl)) & all(lengths(grl_redux) <= lengths(grl))

# return single GRanges, with rownames derived from hgnc
final <- unlist(grl_redux)

Note that the rownames(final) won't all be unique if some of the genes had non-contiguous sections.

ADD COMMENT • link updated 4.1 years ago by zx8754 11k • written 4.2 years ago by mike.deberardine ▴ 110

0

Entering edit mode

How is this different from zx8754's solution? Please elaborate on why you think this answer warrants updating a well-addressed months-old post.

ADD REPLY • link 4.2 years ago by Ram 43k

2

Entering edit mode

Sorry about that. Mine is orders of magnitude faster for a large genelist, doesn't convert to a data.table, and doesn't throw away chromosome names

ADD REPLY • link 4.2 years ago by mike.deberardine ▴ 110

0

Entering edit mode

That's wonderful. Thank you for addressing my questions.

ADD REPLY • link 4.2 years ago by Ram 43k

0

Entering edit mode

Edited the answer, we just needed to add chrom into grouping. As the example data only had chrom==1 I didn't add it to grouping, now fixed.

ADD REPLY • link 4.2 years ago by zx8754 11k

2

Entering edit mode

4.2 years ago

ATpoint 81k

Edit: There is some issue with the markdown that is messing up the code highlighting. Code might appear malformatted.

Two working solutions have been offered, lets offer a third one. The one from zx8754 is the fastest, but it drops chromosome names. The one from mike.deberardine keeps chromosome names and uses GRanges as Nicolas Rosewick requested but is notably slower based on my testing, see below.

Edit: See the comment on this answer for further speed comparisons, GRanges-based solutions outperform data.table for larger objects.

My solution is similar to the proposed solution from mike.deberardine but more simplistic as it simply concatenates chromosome and gene name into a new seqnames() argument so one can use reduce() directly without any split() which saves time. One then simply makes a new output GRanges() based on the output of reduced(). It is notably faster than the existing GRanges-based solution but a bit slower than the data.table solution, maybe a good compromise.

ADD COMMENT • link 4.2 years ago by ATpoint 81k

1

Entering edit mode

Mike's solution is just 2 lines, the rest is the same code with description. Also, microbenchmark accepts multiple solutions. Something like microbenchmark(x = {...}, y = {...}, z = {...}, ...)

ADD REPLY • link 4.2 years ago by zx8754 11k

0

Entering edit mode

Good to know about these multiple input!

ADD REPLY • link 4.2 years ago by ATpoint 81k

0

Entering edit mode

Here an example with a large object, in this case the coordinates of all annotated mouse transcripts from Ensembl. Here GRanges-based solutions are way faster, mine the fastest, from mike.deberardine the second fastest, and the data.table-based solution is super-slow, not sure why this is.

> dim(mmusculus_genes)
[1] 141368      4

>microbenchmark(fromZx(data.table(large.input)), times = 10)
## (... not finished after 2 min)

> microbenchmark(fromMike(data.table(large.input)), times = 10)
Unit: seconds
                              expr      min       lq     mean  median       uq      max neval
 fromMike(data.table(large.input)) 1.602599 1.604307 1.677332 1.65432 1.702985 1.863461    10

> microbenchmark(fromMe(data.table(large.input)), times = 10)
Unit: milliseconds
                            expr      min       lq     mean   median       uq      max neval
 fromMe(data.table(large.input)) 677.6056 688.0287 734.5014 699.4426 782.0847 866.0241    10

As one can see mine takes 700ms, the other GR-based one 1.6 seconds even for such a large object, so it does not really matter for the end user what to use.

ADD REPLY • link 4.2 years ago by ATpoint 81k

0

Entering edit mode

Just want to emphasize that mine is actually only 2 lines long; the other lines were just for explanation

ADD REPLY • link 4.1 years ago by mike.deberardine ▴ 110

score 7 · Accepted Answer · 2019-06-26

Looks like we can't avoid reduce(), but we can avoid do.call(rbind, lapply(... using data.table:

library(data.table) 
library(GenomicRanges) # reduce function

# example data
x <- fread("
chrom   start   end hgnc
1   100 200 MYC
1   150 300 MYC
1   400 500 MYC
1   150 230 TP53
1   200 350 TP53
1   420 550 TP53")

x[, as.data.table(reduce(IRanges(start, end))), by = .(chrom, hgnc)]
#    chrom hgnc start end width
# 1:     1  MYC   100 300   201
# 2:     1  MYC   400 500   101
# 3:     1 TP53   150 350   201
# 4:     1 TP53   420 550   131

Code stolen and adapted from StackOverflow post:

Finding all overlaps in one iteration of foverlap in R's data.table