I have a set of CNV GenomicRanges co-ordinates mapped to hg19 and need to lift them over to hg38. I have a solution to this below, however it seems very inefficient and is taking too long, given my granges objects are ~1 million in length.
library(GenomicRanges)
library(rtracklayer)
Here are my hg19 ranges:
hg19_gr <- GRanges(
seqnames = Rle(c("chr1", "chr6", "chr13")),
ranges = IRanges(
start = c(72776, 81294341, 107384238),
end = c(16963313, 94652407, 107384710)
),
strand = Rle("*")
)
hg19_gr
GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 72776-16963313 *
[2] chr6 81294341-94652407 *
[3] chr13 107384238-107384710 *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
I lift these over to hg38 using rtracklayer:
chain <- import.chain("hg19ToHg38.over.chain")
hg38_liftover <- liftOver(hg19_gr, chain)
Some of the ranges are of large genomic regions. For each of these I get a Granges list of the co-ordinate of a set of segments, some of which are on different chromosomes:
hg38_liftover[[1]]
GRanges object with 1038 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 72776-177376 *
[2] chr1 257667-297968 *
[3] chr1 585989-1630687 *
[4] chr1 1630690-1634405 *
[5] chr1 1634409-1635542 *
... ... ... ...
[1034] chr1 13203521-13203532 *
[1035] chr1 13203533-13203548 *
[1036] chr20 38461961-38462015 *
[1037] chr19 242824-242864 *
[1038] chr1 12893309-12893335 *
-------
seqinfo: 5 sequences from an unspecified genome; no seqlengths
To summarise each GRangesList object back into one range:
hg38_gr_list <- lapply(hg38_liftover, function(x) {
# find the parent chromosome, i.e. one the majority of the ranges belong to:
main_chr <- seqnames(x)@values[
which.max(seqnames(x)@lengths)
]
# filter out ranges from other chromosomes:
main_gr <- sort(x[seqnames(x) == main_chr])
# to collapse into a final range starting with the earliest co-ordinate
# and ending with the latest, first fill in gaps between the ranges,
# removing the 'gap' from position 1 to the first range of main_gr,
# then reduce to collapse into one range:
new_gr <- reduce(c(main_gr, gaps(main_gr)[-1]))
})
Lastly, I convert the GRangesList object into a genomic ranges object:
hg38_gr <- do.call("c", hg38_gr_list)
Now I have new ranges with hg38 co-ordinates. As I mentioned above though, this seems very inefficient. Both the lapply command and the do.call command to collapse the list are taking a long time. I was wondering if anyone could suggest a better or quicker way to solve this problem, or whether there is already an easier solution within rtracklayer/GenomicRanges which I am missing? Any help would be greatly appreciated!