Question: GenomicRanges : How to reduce a GRanges by gene
2
gravatar for Nicolas Rosewick
9 months ago by
Belgium, Brussels
Nicolas Rosewick8.7k wrote:

Hi,

I've a GRanges representing a intervals for all genes in the genome. A lot of these intervals are overlapping. I would like to use the reduce() from GenomicRanges package in order to make a non-overlapping set of interval. However I would like to do it for each gene separately. Thus for one specific gene, intervals for this gene should not overlap ; but intervals for different genes may overlap. One solution would like to split the GRanges by gene and apply reduce() on each subset but I'm wondering if there is a more efficient way ?

Thanks

Actual data

chrom   start   end hgnc
1   100 200 MYC
1   150 300 MYC
1   400 500 MYC
1   150 230 TP53
1   200 350 TP53
1   420 550 TP53

expected result

chrom   start   end hgnc
1   100 300 MYC
1   400 500 MYC
1   150 350 TP53
1   420 550 TP53

My actual solution :

# gene is the dataframe used to create the initial GRanges
 do.call(rbind,lapply(
  split(gene,gene$hgnc),
  function(x){
    as.data.frame(
      reduce(
        GRanges(x$chrom,IRanges(x$start,x$end))))}))
granges reduce genomicranges R • 607 views
ADD COMMENTlink modified 5 weeks ago by ATpoint32k • written 9 months ago by Nicolas Rosewick8.7k

Do you expect 2 rows for this example data? If yes, then group by hgnc, get min(start) max(end) ?

library(data.table)

x[ , list(chr = head(chrom, 1), start = min(start), end = max(end)), by = hgnc]
#    hgnc chr     start       end
# 1:  MYC   8 128747680 128753674
# 2: TP53  17   7565097   7590856
ADD REPLYlink modified 9 months ago • written 9 months ago by zx87549.1k

The example is maybe not the best indeed. In this case I expect two lines yes. But I can have more than 1 line per gene in the end (if there is multiple non-overlapping intervals ; that's why I use the reduce() function )

ADD REPLYlink modified 9 months ago • written 9 months ago by Nicolas Rosewick8.7k

Please provide better data.

As long as genes do not overlap, simply doing this would work, too? reduce(GRanges(x$chrom, IRanges(x$start, x$end)))

ADD REPLYlink modified 9 months ago • written 9 months ago by zx87549.1k

Just changed with dummy data more suited to the question. and expect result.

ADD REPLYlink written 9 months ago by Nicolas Rosewick8.7k
4
gravatar for zx8754
9 months ago by
zx87549.1k
London
zx87549.1k wrote:

Looks like we can't avoid reduce(), but we can avoid do.call(rbind, lapply(... using data.table:

library(data.table) 
library(GenomicRanges) # reduce function

# example data
x <- fread("
chrom   start   end hgnc
1   100 200 MYC
1   150 300 MYC
1   400 500 MYC
1   150 230 TP53
1   200 350 TP53
1   420 550 TP53")

x[, as.data.table(reduce(IRanges(start, end))), by = .(chrom, hgnc)]
#    chrom hgnc start end width
# 1:     1  MYC   100 300   201
# 2:     1  MYC   400 500   101
# 3:     1 TP53   150 350   201
# 4:     1 TP53   420 550   131

Code stolen and adapted from StackOverflow post:

ADD COMMENTlink modified 5 weeks ago • written 9 months ago by zx87549.1k

thanks @zx8754 . That's what I thought. I was just wondering if there was an obscure GRanges function/parameter that may help this.

ADD REPLYlink written 9 months ago by Nicolas Rosewick8.7k
4
gravatar for mike.deberardine
5 weeks ago by
mike.deberardine60 wrote:

For anyone still looking for a solution, this is much faster than previous responses. Starting from the dataframe (x):

gr <- GRanges(x)
# in short:
final <- unlist(reduce(split(gr, gr$hgnc)))

Here is in more detail, with description of each step

# split into GRangesList
grl <- split(gr, gr$hgnc)
# all(names(grl) == sort(unique(gr$hgnc))

# reduce each element (independently reduce ranges for each gene)
grl_redux <- reduce(grl) # element-wise, like lapply(grl, reduce) 
# all(names(grl_redux) == names(grl)) & all(lengths(grl_redux) <= lengths(grl))

# return single GRanges, with rownames derived from hgnc
final <- unlist(grl_redux)

Note that the rownames(final) won't all be unique if some of the genes had non-contiguous sections.

ADD COMMENTlink modified 5 weeks ago by zx87549.1k • written 5 weeks ago by mike.deberardine60

How is this different from zx8754's solution? Please elaborate on why you think this answer warrants updating a well-addressed months-old post.

ADD REPLYlink written 5 weeks ago by RamRS26k
1

Sorry about that. Mine is orders of magnitude faster for a large genelist, doesn't convert to a data.table, and doesn't throw away chromosome names

ADD REPLYlink written 5 weeks ago by mike.deberardine60

That's wonderful. Thank you for addressing my questions.

ADD REPLYlink written 5 weeks ago by RamRS26k

Edited the answer, we just needed to add chrom into grouping. As the example data only had chrom==1 I didn't add it to grouping, now fixed.

ADD REPLYlink written 5 weeks ago by zx87549.1k
2
gravatar for ATpoint
5 weeks ago by
ATpoint32k
Germany
ATpoint32k wrote:

Edit: There is some issue with the markdown that is messing up the code highlighting. Code might appear malformatted.

Two working solutions have been offered, lets offer a third one. The one from zx8754 is the fastest, but it drops chromosome names. The one from mike.deberardine keeps chromosome names and uses GRanges as Nicolas Rosewick requested but is notably slower based on my testing, see below.

Edit: See the comment on this answer for further speed comparisons, GRanges-based solutions outperform data.table for larger objects.

My solution is similar to the proposed solution from mike.deberardine but more simplistic as it simply concatenates chromosome and gene name into a new seqnames() argument so one can use reduce() directly without any split() which saves time. One then simply makes a new output GRanges() based on the output of reduced(). It is notably faster than the existing GRanges-based solution but a bit slower than the data.table solution, maybe a good compromise.

ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by ATpoint32k
1

Mike's solution is just 2 lines, the rest is the same code with description. Also, microbenchmark accepts multiple solutions. Something like microbenchmark(x = {...}, y = {...}, z = {...}, ...)

ADD REPLYlink written 5 weeks ago by zx87549.1k

Good to know about these multiple input!

ADD REPLYlink written 5 weeks ago by ATpoint32k

Here an example with a large object, in this case the coordinates of all annotated mouse transcripts from Ensembl. Here GRanges-based solutions are way faster, mine the fastest, from mike.deberardine the second fastest, and the data.table-based solution is super-slow, not sure why this is.

> dim(mmusculus_genes)
[1] 141368      4

>microbenchmark(fromZx(data.table(large.input)), times = 10)
## (... not finished after 2 min)

> microbenchmark(fromMike(data.table(large.input)), times = 10)
Unit: seconds
                              expr      min       lq     mean  median       uq      max neval
 fromMike(data.table(large.input)) 1.602599 1.604307 1.677332 1.65432 1.702985 1.863461    10

> microbenchmark(fromMe(data.table(large.input)), times = 10)
Unit: milliseconds
                            expr      min       lq     mean   median       uq      max neval
 fromMe(data.table(large.input)) 677.6056 688.0287 734.5014 699.4426 782.0847 866.0241    10

As one can see mine takes 700ms, the other GR-based one 1.6 seconds even for such a large object, so it does not really matter for the end user what to use.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by ATpoint32k

Just want to emphasize that mine is actually only 2 lines long; the other lines were just for explanation

ADD REPLYlink written 5 weeks ago by mike.deberardine60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1006 users visited in the last hour