GenomicRanges : How to reduce a GRanges by gene
3
2
Entering edit mode
2.9 years ago

Hi,

I've a GRanges representing a intervals for all genes in the genome. A lot of these intervals are overlapping. I would like to use the reduce() from GenomicRanges package in order to make a non-overlapping set of interval. However I would like to do it for each gene separately. Thus for one specific gene, intervals for this gene should not overlap ; but intervals for different genes may overlap. One solution would like to split the GRanges by gene and apply reduce() on each subset but I'm wondering if there is a more efficient way ?

Thanks

Actual data

chrom   start   end hgnc
1   100 200 MYC
1   150 300 MYC
1   400 500 MYC
1   150 230 TP53
1   200 350 TP53
1   420 550 TP53


expected result

chrom   start   end hgnc
1   100 300 MYC
1   400 500 MYC
1   150 350 TP53
1   420 550 TP53


My actual solution :

# gene is the dataframe used to create the initial GRanges
do.call(rbind,lapply(
split(gene,gene$hgnc), function(x){ as.data.frame( reduce( GRanges(x$chrom,IRanges(x$start,x$end))))}))

genomicranges granges reduce R • 5.2k views
2
Entering edit mode

Do you expect 2 rows for this example data? If yes, then group by hgnc, get min(start) max(end) ?

library(data.table)

x[ , list(chr = head(chrom, 1), start = min(start), end = max(end)), by = hgnc]
#    hgnc chr     start       end
# 1:  MYC   8 128747680 128753674
# 2: TP53  17   7565097   7590856

0
Entering edit mode

The example is maybe not the best indeed. In this case I expect two lines yes. But I can have more than 1 line per gene in the end (if there is multiple non-overlapping intervals ; that's why I use the reduce() function )

0
Entering edit mode

As long as genes do not overlap, simply doing this would work, too? reduce(GRanges(x$chrom, IRanges(x$start, x$end))) ADD REPLY 0 Entering edit mode Just changed with dummy data more suited to the question. and expect result. ADD REPLY 6 Entering edit mode 2.9 years ago zx8754 11k Looks like we can't avoid reduce(), but we can avoid do.call(rbind, lapply(... using data.table: library(data.table) library(GenomicRanges) # reduce function # example data x <- fread(" chrom start end hgnc 1 100 200 MYC 1 150 300 MYC 1 400 500 MYC 1 150 230 TP53 1 200 350 TP53 1 420 550 TP53") x[, as.data.table(reduce(IRanges(start, end))), by = .(chrom, hgnc)] # chrom hgnc start end width # 1: 1 MYC 100 300 201 # 2: 1 MYC 400 500 101 # 3: 1 TP53 150 350 201 # 4: 1 TP53 420 550 131  Code stolen and adapted from StackOverflow post: ADD COMMENT 0 Entering edit mode thanks @zx8754 . That's what I thought. I was just wondering if there was an obscure GRanges function/parameter that may help this. ADD REPLY 7 Entering edit mode 2.2 years ago For anyone still looking for a solution, this is much faster than previous responses. Starting from the dataframe (x): gr <- GRanges(x) # in short: final <- unlist(reduce(split(gr, gr$hgnc)))


Here is in more detail, with description of each step

# split into GRangesList
grl <- split(gr, gr$hgnc) # all(names(grl) == sort(unique(gr$hgnc))

# reduce each element (independently reduce ranges for each gene)
grl_redux <- reduce(grl) # element-wise, like lapply(grl, reduce)
# all(names(grl_redux) == names(grl)) & all(lengths(grl_redux) <= lengths(grl))

# return single GRanges, with rownames derived from hgnc
final <- unlist(grl_redux)


Note that the rownames(final) won't all be unique if some of the genes had non-contiguous sections.

0
Entering edit mode

How is this different from zx8754's solution? Please elaborate on why you think this answer warrants updating a well-addressed months-old post.

2
Entering edit mode

Sorry about that. Mine is orders of magnitude faster for a large genelist, doesn't convert to a data.table, and doesn't throw away chromosome names

0
Entering edit mode

That's wonderful. Thank you for addressing my questions.

0
Entering edit mode

Edited the answer, we just needed to add chrom into grouping. As the example data only had chrom==1 I didn't add it to grouping, now fixed.

2
Entering edit mode
2.2 years ago
ATpoint 60k

Edit: There is some issue with the markdown that is messing up the code highlighting. Code might appear malformatted.

Two working solutions have been offered, lets offer a third one. The one from zx8754 is the fastest, but it drops chromosome names. The one from mike.deberardine keeps chromosome names and uses GRanges as Nicolas Rosewick requested but is notably slower based on my testing, see below.

Edit: See the comment on this answer for further speed comparisons, GRanges-based solutions outperform data.table for larger objects.

My solution is similar to the proposed solution from mike.deberardine but more simplistic as it simply concatenates chromosome and gene name into a new seqnames() argument so one can use reduce() directly without any split() which saves time. One then simply makes a new output GRanges() based on the output of reduced(). It is notably faster than the existing GRanges-based solution but a bit slower than the data.table solution, maybe a good compromise.

1
Entering edit mode

Mike's solution is just 2 lines, the rest is the same code with description. Also, microbenchmark accepts multiple solutions. Something like microbenchmark(x = {...}, y = {...}, z = {...}, ...)

0
Entering edit mode

Good to know about these multiple input!

0
Entering edit mode

Here an example with a large object, in this case the coordinates of all annotated mouse transcripts from Ensembl. Here GRanges-based solutions are way faster, mine the fastest, from mike.deberardine the second fastest, and the data.table-based solution is super-slow, not sure why this is.

> dim(mmusculus_genes)
[1] 141368      4

>microbenchmark(fromZx(data.table(large.input)), times = 10)
## (... not finished after 2 min)

> microbenchmark(fromMike(data.table(large.input)), times = 10)
Unit: seconds
expr      min       lq     mean  median       uq      max neval
fromMike(data.table(large.input)) 1.602599 1.604307 1.677332 1.65432 1.702985 1.863461    10

> microbenchmark(fromMe(data.table(large.input)), times = 10)
Unit: milliseconds
expr      min       lq     mean   median       uq      max neval
fromMe(data.table(large.input)) 677.6056 688.0287 734.5014 699.4426 782.0847 866.0241    10


As one can see mine takes 700ms, the other GR-based one 1.6 seconds even for such a large object, so it does not really matter for the end user what to use.

0
Entering edit mode

Just want to emphasize that mine is actually only 2 lines long; the other lines were just for explanation