Question: Merge duplicate GRanges ranges
2
gravatar for ifreecell
4.6 years ago by
ifreecell170
Germany
ifreecell170 wrote:

Hi there, I have a GRanges object that I wanna merge duplicate ranges and append counts as a metadata columns. I know similar thing can be easily done using uniq -c, but I want do it in R/Bioconductor. Is there any suggestion?

>orginalGRanges
GRanges with 6 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     Chr1   [2, 24]      +
  [2]     Chr1   [2, 25]      +
  [3]     Chr1   [2, 25]      +
  [4]     Chr1   [4, 25]      +
  [5]     Chr1   [5, 25]      +
  [6]     Chr1   [6, 21]      +
  ---
  seqlengths:
           Chr1
             NA
>expectedGRanges
GRanges with 5 ranges and 1 metadata column:
      seqnames    ranges strand |      hits
         <Rle> <IRanges>  <Rle> | <numeric>
  [1]     Chr1   [2, 24]      + |         1
  [2]     Chr1   [2, 25]      + |         2
  [3]     Chr1   [4, 25]      + |         1
  [4]     Chr1   [5, 25]      + |         1
  [5]     Chr1   [6, 21]      + |         1
  ---
  seqlengths:
           Chr1
             NA
granges merge R • 4.7k views
ADD COMMENTlink modified 4.6 years ago by Devon Ryan89k • written 4.6 years ago by ifreecell170
3
gravatar for Devon Ryan
4.6 years ago by
Devon Ryan89k
Freiburg, Germany
Devon Ryan89k wrote:

You're looking for the countOverlaps() function:

g <- GRanges(seqnames=c(rep("Chr1",6)), strand=rep("+",6), ranges=IRanges(start=c(2,2,2,4,5,6), end=c(24,25,25,25,25,21)))
g2 <- unique(g)
g2$hits<- countOverlaps(g2, g, type="equal")

You could also run countOverlaps(g,g,type="equal") first and then use unique(). You'll get the same result either way.

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by Devon Ryan89k

I am testing your approach on a sorted 5 million ranges object, it took hours before Rstudio encountered a fetal error. BTW, countOverlaps is really RAM consuming, it ate up all RAM (64G) on my computer.

ADD REPLYlink written 4.6 years ago by ifreecell170

Yeah, I imagine that that could prove pretty memory heavy, since I don't think it assumes that things are sorted. You might split() by chromosome and then use lapply(). Perhaps that'll be a bit more reasonable.

ADD REPLYlink written 4.6 years ago by Devon Ryan89k

There is only one chromosome in this GRanges, I don't think split() will help. I wrote the GRanges to a bed file, and it took one second for uniq -c to compute. I really appreciate it if the entire workflow can be done in Bioconductor,  is there other workaround? 

ADD REPLYlink written 4.6 years ago by ifreecell170

Is it the unique() step that takes forever or just countOverlaps (or both)? I would guess that it's just the countOverlaps() function. Since you only ever care about exact overlaps/matches, you might be best off writing a custom function to do this. At least if you have to do this more than once or twice.

ADD REPLYlink written 4.6 years ago by Devon Ryan89k

You are right, it's just the countOverlaps() function. Thank you anyway, I will try to write a custom function.

ADD REPLYlink written 4.6 years ago by ifreecell170
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1855 users visited in the last hour