Question: GenomicRanges : How to reduce a GRanges by gene
2
gravatar for Nicolas Rosewick
20 months ago by
Belgium, Brussels
Nicolas Rosewick9.3k wrote:

Hi,

I've a GRanges representing a intervals for all genes in the genome. A lot of these intervals are overlapping. I would like to use the reduce() from GenomicRanges package in order to make a non-overlapping set of interval. However I would like to do it for each gene separately. Thus for one specific gene, intervals for this gene should not overlap ; but intervals for different genes may overlap. One solution would like to split the GRanges by gene and apply reduce() on each subset but I'm wondering if there is a more efficient way ?

Thanks

Actual data

chrom   start   end hgnc
1   100 200 MYC
1   150 300 MYC
1   400 500 MYC
1   150 230 TP53
1   200 350 TP53
1   420 550 TP53

expected result

chrom   start   end hgnc
1   100 300 MYC
1   400 500 MYC
1   150 350 TP53
1   420 550 TP53

My actual solution :

# gene is the dataframe used to create the initial GRanges
 do.call(rbind,lapply(
  split(gene,gene$hgnc),
  function(x){
    as.data.frame(
      reduce(
        GRanges(x$chrom,IRanges(x$start,x$end))))}))
granges reduce genomicranges R • 2.3k views
ADD COMMENTlink modified 12 months ago by ATpoint46k • written 20 months ago by Nicolas Rosewick9.3k

Do you expect 2 rows for this example data? If yes, then group by hgnc, get min(start) max(end) ?

library(data.table)

x[ , list(chr = head(chrom, 1), start = min(start), end = max(end)), by = hgnc]
#    hgnc chr     start       end
# 1:  MYC   8 128747680 128753674
# 2: TP53  17   7565097   7590856
ADD REPLYlink modified 20 months ago • written 20 months ago by zx87549.9k

The example is maybe not the best indeed. In this case I expect two lines yes. But I can have more than 1 line per gene in the end (if there is multiple non-overlapping intervals ; that's why I use the reduce() function )

ADD REPLYlink modified 20 months ago • written 20 months ago by Nicolas Rosewick9.3k

Please provide better data.

As long as genes do not overlap, simply doing this would work, too? reduce(GRanges(x$chrom, IRanges(x$start, x$end)))

ADD REPLYlink modified 20 months ago • written 20 months ago by zx87549.9k

Just changed with dummy data more suited to the question. and expect result.

ADD REPLYlink written 20 months ago by Nicolas Rosewick9.3k
4
gravatar for zx8754
20 months ago by
zx87549.9k
London
zx87549.9k wrote:

Looks like we can't avoid reduce(), but we can avoid do.call(rbind, lapply(... using data.table:

library(data.table) 
library(GenomicRanges) # reduce function

# example data
x <- fread("
chrom   start   end hgnc
1   100 200 MYC
1   150 300 MYC
1   400 500 MYC
1   150 230 TP53
1   200 350 TP53
1   420 550 TP53")

x[, as.data.table(reduce(IRanges(start, end))), by = .(chrom, hgnc)]
#    chrom hgnc start end width
# 1:     1  MYC   100 300   201
# 2:     1  MYC   400 500   101
# 3:     1 TP53   150 350   201
# 4:     1 TP53   420 550   131

Code stolen and adapted from StackOverflow post:

ADD COMMENTlink modified 12 months ago • written 20 months ago by zx87549.9k

thanks @zx8754 . That's what I thought. I was just wondering if there was an obscure GRanges function/parameter that may help this.

ADD REPLYlink written 20 months ago by Nicolas Rosewick9.3k
6
gravatar for mike.deberardine
12 months ago by
mike.deberardine80 wrote:

For anyone still looking for a solution, this is much faster than previous responses. Starting from the dataframe (x):

gr <- GRanges(x)
# in short:
final <- unlist(reduce(split(gr, gr$hgnc)))

Here is in more detail, with description of each step

# split into GRangesList
grl <- split(gr, gr$hgnc)
# all(names(grl) == sort(unique(gr$hgnc))

# reduce each element (independently reduce ranges for each gene)
grl_redux <- reduce(grl) # element-wise, like lapply(grl, reduce) 
# all(names(grl_redux) == names(grl)) & all(lengths(grl_redux) <= lengths(grl))

# return single GRanges, with rownames derived from hgnc
final <- unlist(grl_redux)

Note that the rownames(final) won't all be unique if some of the genes had non-contiguous sections.

ADD COMMENTlink modified 12 months ago by zx87549.9k • written 12 months ago by mike.deberardine80

How is this different from zx8754's solution? Please elaborate on why you think this answer warrants updating a well-addressed months-old post.

ADD REPLYlink written 12 months ago by Ram32k
1

Sorry about that. Mine is orders of magnitude faster for a large genelist, doesn't convert to a data.table, and doesn't throw away chromosome names

ADD REPLYlink written 12 months ago by mike.deberardine80

That's wonderful. Thank you for addressing my questions.

ADD REPLYlink written 12 months ago by Ram32k

Edited the answer, we just needed to add chrom into grouping. As the example data only had chrom==1 I didn't add it to grouping, now fixed.

ADD REPLYlink written 12 months ago by zx87549.9k
2
gravatar for ATpoint
12 months ago by
ATpoint46k
ATpoint46k wrote:

Edit: There is some issue with the markdown that is messing up the code highlighting. Code might appear malformatted.

Two working solutions have been offered, lets offer a third one. The one from zx8754 is the fastest, but it drops chromosome names. The one from mike.deberardine keeps chromosome names and uses GRanges as Nicolas Rosewick requested but is notably slower based on my testing, see below.

Edit: See the comment on this answer for further speed comparisons, GRanges-based solutions outperform data.table for larger objects.

My solution is similar to the proposed solution from mike.deberardine but more simplistic as it simply concatenates chromosome and gene name into a new seqnames() argument so one can use reduce() directly without any split() which saves time. One then simply makes a new output GRanges() based on the output of reduced(). It is notably faster than the existing GRanges-based solution but a bit slower than the data.table solution, maybe a good compromise.

ADD COMMENTlink modified 12 months ago • written 12 months ago by ATpoint46k
1

Mike's solution is just 2 lines, the rest is the same code with description. Also, microbenchmark accepts multiple solutions. Something like microbenchmark(x = {...}, y = {...}, z = {...}, ...)

ADD REPLYlink written 12 months ago by zx87549.9k

Good to know about these multiple input!

ADD REPLYlink written 12 months ago by ATpoint46k

Here an example with a large object, in this case the coordinates of all annotated mouse transcripts from Ensembl. Here GRanges-based solutions are way faster, mine the fastest, from mike.deberardine the second fastest, and the data.table-based solution is super-slow, not sure why this is.

> dim(mmusculus_genes)
[1] 141368      4

>microbenchmark(fromZx(data.table(large.input)), times = 10)
## (... not finished after 2 min)

> microbenchmark(fromMike(data.table(large.input)), times = 10)
Unit: seconds
                              expr      min       lq     mean  median       uq      max neval
 fromMike(data.table(large.input)) 1.602599 1.604307 1.677332 1.65432 1.702985 1.863461    10

> microbenchmark(fromMe(data.table(large.input)), times = 10)
Unit: milliseconds
                            expr      min       lq     mean   median       uq      max neval
 fromMe(data.table(large.input)) 677.6056 688.0287 734.5014 699.4426 782.0847 866.0241    10

As one can see mine takes 700ms, the other GR-based one 1.6 seconds even for such a large object, so it does not really matter for the end user what to use.

ADD REPLYlink modified 12 months ago • written 12 months ago by ATpoint46k

Just want to emphasize that mine is actually only 2 lines long; the other lines were just for explanation

ADD REPLYlink written 12 months ago by mike.deberardine80
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1304 users visited in the last hour
_