More efficient way of lifting over and collapsing CNV GenomicRanges using rtracklayer?
Entering edit mode
11 weeks ago
j.torpy • 0

I have a set of CNV GenomicRanges co-ordinates mapped to hg19 and need to lift them over to hg38. I have a solution to this below, however it seems very inefficient and is taking too long, given my granges objects are ~1 million in length.


Here are my hg19 ranges:

hg19_gr <- GRanges(
  seqnames = Rle(c("chr1", "chr6", "chr13")),
  ranges = IRanges(
    start = c(72776, 81294341, 107384238), 
    end = c(16963313, 94652407, 107384710)
  strand = Rle("*")

GRanges object with 3 ranges and 0 metadata columns:
       seqnames              ranges strand
          <Rle>           <IRanges>  <Rle>
   [1]     chr1      72776-16963313      *
   [2]     chr6   81294341-94652407      *
   [3]    chr13 107384238-107384710      *
   seqinfo: 3 sequences from an unspecified genome; no seqlengths

I lift these over to hg38 using rtracklayer:

chain <- import.chain("hg19ToHg38.over.chain")
hg38_liftover <- liftOver(hg19_gr, chain)

Some of the ranges are of large genomic regions. For each of these I get a Granges list of the co-ordinate of a set of segments, some of which are on different chromosomes:

GRanges object with 1038 ranges and 0 metadata columns:
        seqnames            ranges strand
           <Rle>         <IRanges>  <Rle>
    [1]     chr1      72776-177376      *
    [2]     chr1     257667-297968      *
    [3]     chr1    585989-1630687      *
    [4]     chr1   1630690-1634405      *
    [5]     chr1   1634409-1635542      *
    ...      ...               ...    ...
    [1034]     chr1 13203521-13203532      *
    [1035]     chr1 13203533-13203548      *
    [1036]    chr20 38461961-38462015      *
    [1037]    chr19     242824-242864      *
    [1038]     chr1 12893309-12893335      *
  seqinfo: 5 sequences from an unspecified genome; no seqlengths

To summarise each GRangesList object back into one range:

hg38_gr_list <- lapply(hg38_liftover, function(x) {

  # find the parent chromosome, i.e. one the majority of the ranges belong to:
  main_chr <- seqnames(x)@values[

  # filter out ranges from other chromosomes:
  main_gr <- sort(x[seqnames(x) == main_chr])

  # to collapse into a final range starting with the earliest co-ordinate
  # and ending with the latest, first fill in gaps between the ranges,
  # removing the 'gap' from position 1 to the first range of main_gr,
  # then reduce to collapse into one range:
  new_gr <- reduce(c(main_gr, gaps(main_gr)[-1]))


Lastly, I convert the GRangesList object into a genomic ranges object:

hg38_gr <-"c", hg38_gr_list)

Now I have new ranges with hg38 co-ordinates. As I mentioned above though, this seems very inefficient. Both the lapply command and the command to collapse the list are taking a long time. I was wondering if anyone could suggest a better or quicker way to solve this problem, or whether there is already an easier solution within rtracklayer/GenomicRanges which I am missing? Any help would be greatly appreciated!

R rtracklayer • 175 views

Login before adding your answer.

Traffic: 2300 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6