Entering edit mode
6.6 years ago
ifreecell
▴
180
Hi there, I have a GRanges object that I wanna merge duplicate ranges and append counts as a metadata columns. I know similar thing can be easily done using uniq -c
, but I want do it in R/Bioconductor. Is there any suggestion?
>orginalGRanges
GRanges with 6 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] Chr1 [2, 24] + [2] Chr1 [2, 25] + [3] Chr1 [2, 25] + [4] Chr1 [4, 25] + [5] Chr1 [5, 25] + [6] Chr1 [6, 21] + --- seqlengths: Chr1 NA
>expectedGRanges
GRanges with 5 ranges and 1 metadata column: seqnames ranges strand | hits <Rle> <IRanges> <Rle> | <numeric> [1] Chr1 [2, 24] + | 1 [2] Chr1 [2, 25] + | 2 [3] Chr1 [4, 25] + | 1 [4] Chr1 [5, 25] + | 1 [5] Chr1 [6, 21] + | 1 --- seqlengths: Chr1 NA
I am testing your approach on a sorted 5 million ranges object, it took hours before Rstudio encountered a fetal error. BTW, countOverlaps is really RAM consuming, it ate up all RAM (64G) on my computer.
Yeah, I imagine that that could prove pretty memory heavy, since I don't think it assumes that things are sorted. You might split() by chromosome and then use lapply(). Perhaps that'll be a bit more reasonable.
There is only one chromosome in this GRanges, I don't think split() will help. I wrote the GRanges to a bed file, and it took one second for
uniq -c
to compute. I really appreciate it if the entire workflow can be done in Bioconductor, is there other workaround?Is it the unique() step that takes forever or just countOverlaps (or both)? I would guess that it's just the countOverlaps() function. Since you only ever care about exact overlaps/matches, you might be best off writing a custom function to do this. At least if you have to do this more than once or twice.
You are right, it's just the countOverlaps() function. Thank you anyway, I will try to write a custom function.