Hi,
I've a GRanges representing a intervals for all genes in the genome. A lot of these intervals are overlapping. I would like to use the reduce() from GenomicRanges package in order to make a non-overlapping set of interval. However I would like to do it for each gene separately. Thus for one specific gene, intervals for this gene should not overlap ; but intervals for different genes may overlap. One solution would like to split the GRanges by gene and apply reduce() on each subset but I'm wondering if there is a more efficient way ?
Thanks
Actual data
chrom   start   end hgnc
1   100 200 MYC
1   150 300 MYC
1   400 500 MYC
1   150 230 TP53
1   200 350 TP53
1   420 550 TP53
expected result
chrom   start   end hgnc
1   100 300 MYC
1   400 500 MYC
1   150 350 TP53
1   420 550 TP53
My actual solution :
# gene is the dataframe used to create the initial GRanges
 do.call(rbind,lapply(
  split(gene,gene$hgnc),
  function(x){
    as.data.frame(
      reduce(
        GRanges(x$chrom,IRanges(x$start,x$end))))}))
                    
                
                
Do you expect 2 rows for this example data? If yes, then group by hgnc, get min(start) max(end) ?
The example is maybe not the best indeed. In this case I expect two lines yes. But I can have more than 1 line per gene in the end (if there is multiple non-overlapping intervals ; that's why I use the
reduce()function )Please provide better data.
As long as genes do not overlap, simply doing this would work, too?
reduce(GRanges(x$chrom, IRanges(x$start, x$end)))Just changed with dummy data more suited to the question. and expect result.