Question

How to obtain distinct/uniqe rows from GenomicRanges object

1

Entering edit mode

4.8 years ago

gundalav ▴ 380

I have the following GenomicRanges object created with this:

library(GenomicRanges)
gr <- GRanges(seqnames = "chr1", strand = c("+", "-","-", "+"),ranges = IRanges(start = c(1,3,3,5), width = 3))
gr

GRanges object with 4 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1       1-3      +
  [2]     chr1       3-5      -
  [3]     chr1       3-5      -
  [4]     chr1       5-7      +

What I want to do is to obtain the unique rows from there, yielding this (hand-coded)

GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1       1-3      +
  [2]     chr1       3-5      -
  [3]     chr1       5-7      +

How can I achieve that? In reality, I have around 9 million rows to process.

I can use this method but very slow:

 library(tidyverse)
 gr %>% 
   as.tibble() %>% 
   distinct()

R GenomicRanges bioconductor • 5.3k views

ADD COMMENT • link updated 3 months ago by Ram 43k • written 4.8 years ago by gundalav ▴ 380

score 2 · Answer 1 · 2019-06-26

2

Entering edit mode

4.8 years ago

zx8754 11k

Use unique as usual (no need for tidyverse):

unique(gr)
# GRanges object with 3 ranges and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]     chr1       1-3      +
#   [2]     chr1       3-5      -
#   [3]     chr1       5-7      +
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Then convert to data.frame if needed:

data.frame(unique(gr))
#     seqnames start end width strand
#   1     chr1     1   3     3      +
#   2     chr1     3   5     3      -
#   3     chr1     5   7     3      +

ADD COMMENT • link 4.8 years ago by zx8754 11k

1

Entering edit mode

Just be aware that unique() will ignore the data in the GRanges mcols

a_gr <- GRanges(seqnames = 1,
            ranges = IRanges(start=c(1,1),
                             end =c(2,2)), 
            strand=c("+"),
            other=c("a","b"))
a_gr
#GRanges object with 2 ranges and 1 metadata column:
#  seqnames    ranges strand |       other
#   <Rle> <IRanges>  <Rle> | <character>
#[1]        1       1-2      + |           a
#[2]        1       1-2      + |           b

unique(a_gr)
#GRanges object with 1 range and 1 metadata column:
#seqnames    ranges strand |       other
#   <Rle> <IRanges>  <Rle> | <character>
#[1]        1       1-2      + |           a

ADD REPLY • link 2.7 years ago by andreabarghetti ▴ 10

0

Entering edit mode

Is there a way to do this that considers the metadata??

ADD REPLY • link 3 months ago by jon.klonowski ▴ 150

score 0 · Answer 2 · 2023-12-21

For those that want to remove duplicates without ignoring the metadata, you have to make a unique identifier and then remove all the duplicates:

starts <- GenomicRanges::start(a_gr) |> stringr::str_pad(9L, pad="0")
a_gr$key <- paste0(GenomicRanges::seqnames(a_gr), ":", starts,":" ,a_gr$strand, ":", a_gr$other)
a_gr[!duplicated(a_gr$key)]