GenomicRanges : How to reduce a GRanges by gene
3
2
Entering edit mode
4.8 years ago

Hi,

I've a GRanges representing a intervals for all genes in the genome. A lot of these intervals are overlapping. I would like to use the reduce() from GenomicRanges package in order to make a non-overlapping set of interval. However I would like to do it for each gene separately. Thus for one specific gene, intervals for this gene should not overlap ; but intervals for different genes may overlap. One solution would like to split the GRanges by gene and apply reduce() on each subset but I'm wondering if there is a more efficient way ?

Thanks

Actual data

chrom   start   end hgnc
1   100 200 MYC
1   150 300 MYC
1   400 500 MYC
1   150 230 TP53
1   200 350 TP53
1   420 550 TP53

expected result

chrom   start   end hgnc
1   100 300 MYC
1   400 500 MYC
1   150 350 TP53
1   420 550 TP53

My actual solution :

# gene is the dataframe used to create the initial GRanges
 do.call(rbind,lapply(
  split(gene,gene$hgnc),
  function(x){
    as.data.frame(
      reduce(
        GRanges(x$chrom,IRanges(x$start,x$end))))}))
genomicranges granges reduce R • 9.3k views
ADD COMMENT
2
Entering edit mode

Do you expect 2 rows for this example data? If yes, then group by hgnc, get min(start) max(end) ?

library(data.table)

x[ , list(chr = head(chrom, 1), start = min(start), end = max(end)), by = hgnc]
#    hgnc chr     start       end
# 1:  MYC   8 128747680 128753674
# 2: TP53  17   7565097   7590856
ADD REPLY
0
Entering edit mode

The example is maybe not the best indeed. In this case I expect two lines yes. But I can have more than 1 line per gene in the end (if there is multiple non-overlapping intervals ; that's why I use the reduce() function )

ADD REPLY
0
Entering edit mode

Please provide better data.

As long as genes do not overlap, simply doing this would work, too? reduce(GRanges(x$chrom, IRanges(x$start, x$end)))

ADD REPLY
0
Entering edit mode

Just changed with dummy data more suited to the question. and expect result.

ADD REPLY
7
Entering edit mode
4.8 years ago
zx8754 11k

Looks like we can't avoid reduce(), but we can avoid do.call(rbind, lapply(... using data.table:

library(data.table) 
library(GenomicRanges) # reduce function

# example data
x <- fread("
chrom   start   end hgnc
1   100 200 MYC
1   150 300 MYC
1   400 500 MYC
1   150 230 TP53
1   200 350 TP53
1   420 550 TP53")

x[, as.data.table(reduce(IRanges(start, end))), by = .(chrom, hgnc)]
#    chrom hgnc start end width
# 1:     1  MYC   100 300   201
# 2:     1  MYC   400 500   101
# 3:     1 TP53   150 350   201
# 4:     1 TP53   420 550   131

Code stolen and adapted from StackOverflow post:

ADD COMMENT
0
Entering edit mode

thanks @zx8754 . That's what I thought. I was just wondering if there was an obscure GRanges function/parameter that may help this.

ADD REPLY
8
Entering edit mode
4.2 years ago

For anyone still looking for a solution, this is much faster than previous responses. Starting from the dataframe (x):

gr <- GRanges(x)
# in short:
final <- unlist(reduce(split(gr, gr$hgnc)))

Here is in more detail, with description of each step

# split into GRangesList
grl <- split(gr, gr$hgnc)
# all(names(grl) == sort(unique(gr$hgnc))

# reduce each element (independently reduce ranges for each gene)
grl_redux <- reduce(grl) # element-wise, like lapply(grl, reduce) 
# all(names(grl_redux) == names(grl)) & all(lengths(grl_redux) <= lengths(grl))

# return single GRanges, with rownames derived from hgnc
final <- unlist(grl_redux)

Note that the rownames(final) won't all be unique if some of the genes had non-contiguous sections.

ADD COMMENT
0
Entering edit mode

How is this different from zx8754's solution? Please elaborate on why you think this answer warrants updating a well-addressed months-old post.

ADD REPLY
2
Entering edit mode

Sorry about that. Mine is orders of magnitude faster for a large genelist, doesn't convert to a data.table, and doesn't throw away chromosome names

ADD REPLY
0
Entering edit mode

That's wonderful. Thank you for addressing my questions.

ADD REPLY
0
Entering edit mode

Edited the answer, we just needed to add chrom into grouping. As the example data only had chrom==1 I didn't add it to grouping, now fixed.

ADD REPLY
2
Entering edit mode
4.2 years ago
ATpoint 81k

Edit: There is some issue with the markdown that is messing up the code highlighting. Code might appear malformatted.

Two working solutions have been offered, lets offer a third one. The one from zx8754 is the fastest, but it drops chromosome names. The one from mike.deberardine keeps chromosome names and uses GRanges as Nicolas Rosewick requested but is notably slower based on my testing, see below.

Edit: See the comment on this answer for further speed comparisons, GRanges-based solutions outperform data.table for larger objects.

My solution is similar to the proposed solution from mike.deberardine but more simplistic as it simply concatenates chromosome and gene name into a new seqnames() argument so one can use reduce() directly without any split() which saves time. One then simply makes a new output GRanges() based on the output of reduced(). It is notably faster than the existing GRanges-based solution but a bit slower than the data.table solution, maybe a good compromise.

ADD COMMENT
1
Entering edit mode

Mike's solution is just 2 lines, the rest is the same code with description. Also, microbenchmark accepts multiple solutions. Something like microbenchmark(x = {...}, y = {...}, z = {...}, ...)

ADD REPLY
0
Entering edit mode

Good to know about these multiple input!

ADD REPLY
0
Entering edit mode

Here an example with a large object, in this case the coordinates of all annotated mouse transcripts from Ensembl. Here GRanges-based solutions are way faster, mine the fastest, from mike.deberardine the second fastest, and the data.table-based solution is super-slow, not sure why this is.

> dim(mmusculus_genes)
[1] 141368      4

>microbenchmark(fromZx(data.table(large.input)), times = 10)
## (... not finished after 2 min)

> microbenchmark(fromMike(data.table(large.input)), times = 10)
Unit: seconds
                              expr      min       lq     mean  median       uq      max neval
 fromMike(data.table(large.input)) 1.602599 1.604307 1.677332 1.65432 1.702985 1.863461    10

> microbenchmark(fromMe(data.table(large.input)), times = 10)
Unit: milliseconds
                            expr      min       lq     mean   median       uq      max neval
 fromMe(data.table(large.input)) 677.6056 688.0287 734.5014 699.4426 782.0847 866.0241    10

As one can see mine takes 700ms, the other GR-based one 1.6 seconds even for such a large object, so it does not really matter for the end user what to use.

ADD REPLY
0
Entering edit mode

Just want to emphasize that mine is actually only 2 lines long; the other lines were just for explanation

ADD REPLY

Login before adding your answer.

Traffic: 1734 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6