More efficient way of lifting over and collapsing CNV GenomicRanges using rtracklayer?
0
0
Entering edit mode
11 weeks ago
j.torpy • 0

I have a set of CNV GenomicRanges co-ordinates mapped to hg19 and need to lift them over to hg38. I have a solution to this below, however it seems very inefficient and is taking too long, given my granges objects are ~1 million in length.

library(GenomicRanges)
library(rtracklayer)

Here are my hg19 ranges:

hg19_gr <- GRanges(
  seqnames = Rle(c("chr1", "chr6", "chr13")),
  ranges = IRanges(
    start = c(72776, 81294341, 107384238), 
    end = c(16963313, 94652407, 107384710)
  ),
  strand = Rle("*")
)

hg19_gr
GRanges object with 3 ranges and 0 metadata columns:
       seqnames              ranges strand
          <Rle>           <IRanges>  <Rle>
   [1]     chr1      72776-16963313      *
   [2]     chr6   81294341-94652407      *
   [3]    chr13 107384238-107384710      *
   -------
   seqinfo: 3 sequences from an unspecified genome; no seqlengths

I lift these over to hg38 using rtracklayer:

chain <- import.chain("hg19ToHg38.over.chain")
hg38_liftover <- liftOver(hg19_gr, chain)

Some of the ranges are of large genomic regions. For each of these I get a Granges list of the co-ordinate of a set of segments, some of which are on different chromosomes:

hg38_liftover[[1]]
GRanges object with 1038 ranges and 0 metadata columns:
        seqnames            ranges strand
           <Rle>         <IRanges>  <Rle>
    [1]     chr1      72776-177376      *
    [2]     chr1     257667-297968      *
    [3]     chr1    585989-1630687      *
    [4]     chr1   1630690-1634405      *
    [5]     chr1   1634409-1635542      *
    ...      ...               ...    ...
    [1034]     chr1 13203521-13203532      *
    [1035]     chr1 13203533-13203548      *
    [1036]    chr20 38461961-38462015      *
    [1037]    chr19     242824-242864      *
    [1038]     chr1 12893309-12893335      *
    -------
  seqinfo: 5 sequences from an unspecified genome; no seqlengths

To summarise each GRangesList object back into one range:

hg38_gr_list <- lapply(hg38_liftover, function(x) {

  # find the parent chromosome, i.e. one the majority of the ranges belong to:
  main_chr <- seqnames(x)@values[
    which.max(seqnames(x)@lengths)
  ]

  # filter out ranges from other chromosomes:
  main_gr <- sort(x[seqnames(x) == main_chr])

  # to collapse into a final range starting with the earliest co-ordinate
  # and ending with the latest, first fill in gaps between the ranges,
  # removing the 'gap' from position 1 to the first range of main_gr,
  # then reduce to collapse into one range:
  new_gr <- reduce(c(main_gr, gaps(main_gr)[-1]))

})

Lastly, I convert the GRangesList object into a genomic ranges object:

hg38_gr <- do.call("c", hg38_gr_list)

Now I have new ranges with hg38 co-ordinates. As I mentioned above though, this seems very inefficient. Both the lapply command and the do.call command to collapse the list are taking a long time. I was wondering if anyone could suggest a better or quicker way to solve this problem, or whether there is already an easier solution within rtracklayer/GenomicRanges which I am missing? Any help would be greatly appreciated!

R rtracklayer • 175 views
ADD COMMENT

Login before adding your answer.

Traffic: 2300 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6