More efficient way of lifting over and collapsing CNV GenomicRanges using rtracklayer?
0
0
Entering edit mode
11 weeks ago
j.torpy • 0

I have a set of CNV GenomicRanges co-ordinates mapped to hg19 and need to lift them over to hg38. I have a solution to this below, however it seems very inefficient and is taking too long, given my granges objects are ~1 million in length.

library(GenomicRanges)
library(rtracklayer)


Here are my hg19 ranges:

hg19_gr <- GRanges(
seqnames = Rle(c("chr1", "chr6", "chr13")),
ranges = IRanges(
start = c(72776, 81294341, 107384238),
end = c(16963313, 94652407, 107384710)
),
strand = Rle("*")
)

hg19_gr
GRanges object with 3 ranges and 0 metadata columns:
seqnames              ranges strand
<Rle>           <IRanges>  <Rle>
[1]     chr1      72776-16963313      *
[2]     chr6   81294341-94652407      *
[3]    chr13 107384238-107384710      *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths


I lift these over to hg38 using rtracklayer:

chain <- import.chain("hg19ToHg38.over.chain")
hg38_liftover <- liftOver(hg19_gr, chain)


Some of the ranges are of large genomic regions. For each of these I get a Granges list of the co-ordinate of a set of segments, some of which are on different chromosomes:

hg38_liftover[[1]]
GRanges object with 1038 ranges and 0 metadata columns:
seqnames            ranges strand
<Rle>         <IRanges>  <Rle>
[1]     chr1      72776-177376      *
[2]     chr1     257667-297968      *
[3]     chr1    585989-1630687      *
[4]     chr1   1630690-1634405      *
[5]     chr1   1634409-1635542      *
...      ...               ...    ...
[1034]     chr1 13203521-13203532      *
[1035]     chr1 13203533-13203548      *
[1036]    chr20 38461961-38462015      *
[1037]    chr19     242824-242864      *
[1038]     chr1 12893309-12893335      *
-------
seqinfo: 5 sequences from an unspecified genome; no seqlengths


To summarise each GRangesList object back into one range:

hg38_gr_list <- lapply(hg38_liftover, function(x) {

# find the parent chromosome, i.e. one the majority of the ranges belong to:
main_chr <- seqnames(x)@values[
which.max(seqnames(x)@lengths)
]

# filter out ranges from other chromosomes:
main_gr <- sort(x[seqnames(x) == main_chr])

# to collapse into a final range starting with the earliest co-ordinate
# and ending with the latest, first fill in gaps between the ranges,
# removing the 'gap' from position 1 to the first range of main_gr,
# then reduce to collapse into one range:
new_gr <- reduce(c(main_gr, gaps(main_gr)[-1]))

})


Lastly, I convert the GRangesList object into a genomic ranges object:

hg38_gr <- do.call("c", hg38_gr_list)


Now I have new ranges with hg38 co-ordinates. As I mentioned above though, this seems very inefficient. Both the lapply command and the do.call command to collapse the list are taking a long time. I was wondering if anyone could suggest a better or quicker way to solve this problem, or whether there is already an easier solution within rtracklayer/GenomicRanges which I am missing? Any help would be greatly appreciated!

R rtracklayer • 175 views