How to obtain distinct/uniqe rows from GenomicRanges object
2
1
Entering edit mode
4.8 years ago
gundalav ▴ 380

I have the following GenomicRanges object created with this:

library(GenomicRanges)
gr <- GRanges(seqnames = "chr1", strand = c("+", "-","-", "+"),ranges = IRanges(start = c(1,3,3,5), width = 3))
gr

GRanges object with 4 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1       1-3      +
  [2]     chr1       3-5      -
  [3]     chr1       3-5      -
  [4]     chr1       5-7      +

What I want to do is to obtain the unique rows from there, yielding this (hand-coded)

GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1       1-3      +
  [2]     chr1       3-5      -
  [3]     chr1       5-7      +

How can I achieve that? In reality, I have around 9 million rows to process.

I can use this method but very slow:

 library(tidyverse)
 gr %>% 
   as.tibble() %>% 
   distinct()
R GenomicRanges bioconductor • 5.3k views
ADD COMMENT
2
Entering edit mode
4.8 years ago
zx8754 11k

Use unique as usual (no need for tidyverse):

unique(gr)
# GRanges object with 3 ranges and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]     chr1       1-3      +
#   [2]     chr1       3-5      -
#   [3]     chr1       5-7      +
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Then convert to data.frame if needed:

data.frame(unique(gr))
#     seqnames start end width strand
#   1     chr1     1   3     3      +
#   2     chr1     3   5     3      -
#   3     chr1     5   7     3      +
ADD COMMENT
1
Entering edit mode

Just be aware that unique() will ignore the data in the GRanges mcols

a_gr <- GRanges(seqnames = 1,
            ranges = IRanges(start=c(1,1),
                             end =c(2,2)), 
            strand=c("+"),
            other=c("a","b"))
a_gr
#GRanges object with 2 ranges and 1 metadata column:
#  seqnames    ranges strand |       other
#   <Rle> <IRanges>  <Rle> | <character>
#[1]        1       1-2      + |           a
#[2]        1       1-2      + |           b

unique(a_gr)
#GRanges object with 1 range and 1 metadata column:
#seqnames    ranges strand |       other
#   <Rle> <IRanges>  <Rle> | <character>
#[1]        1       1-2      + |           a
ADD REPLY
0
Entering edit mode

Is there a way to do this that considers the metadata??

ADD REPLY
0
Entering edit mode
3 months ago
jon.klonowski ▴ 150

For those that want to remove duplicates without ignoring the metadata, you have to make a unique identifier and then remove all the duplicates:

starts <- GenomicRanges::start(a_gr) |> stringr::str_pad(9L, pad="0")
a_gr$key <- paste0(GenomicRanges::seqnames(a_gr), ":", starts,":" ,a_gr$strand, ":", a_gr$other)
a_gr[!duplicated(a_gr$key)]
ADD COMMENT

Login before adding your answer.

Traffic: 2574 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6