GenomicFeatures: Select exons by transcript. Any faster option than exonsBy?
0
0
Entering edit mode
5.7 years ago

Consider this code:

library(microbenchmark)
library(GenomicFeatures)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

microbenchmark::microbenchmark(
# Option 1:
GenomicFeatures::exons(
txdb,
vals = list(tx_name = c("uc001aaa.3", "uc010nxq.1")),
columns = list("EXONNAME", "TXNAME", "GENEID")),
# Option 2:
GenomicFeatures::exonsBy(txdb, by = "tx", use.names = TRUE)[c("uc001aaa.3", "uc010nxq.1")],
times = 10
)

# Option 1 takes an average of 1.6 seconds
# Option 2 takes an average of 6.5 seconds


These two options gets me the same data, but option1 is a lot faster. Is there an easy way I can use option1 and still get the structure that is option2? Or is there a way to make exonsBy faster by only extracting the transcripts I'm interested in?

My end goal is to get a GRangesList object with one GRanges object per transcript, OR a single GRanges object with duplicate entries for exons that appear in multiple transcripts (if that's possible). To begin with I only have the transcript names and the TxDb object.

_Edit_

I started working on an example to help explain what I want to accomplish. Below I'm trying two approaches, using exonsBy() (which succeeds but is slow) and using exons() (which fails but retrieves data faster):

library(Gviz)
library(GenomicFeatures)

options(ucscChromosomeNames=FALSE)
txdbhg19 <- GenomicFeatures::makeTxDbFromBiomart(
biomart = "ENSEMBL_MART_ENSEMBL",
dataset = "hsapiens_gene_ensembl",
host = "dec2013.archive.ensembl.org")

transcriptNames <- c("ENST00000370032", "ENST00000420055")

# ----- APPROACH 1

# Get GrangesList
tst <- exonsBy(txdbhg19, by = "tx", use.names=TRUE)[transcriptNames]
# Unlist to get GRanges
tst <- unlist(tst)
elementMetadata(tst)$transcript <- names(tst) # Create track and plot trTrack <- Gviz::GeneRegionTrack(tst) Gviz::plotTracks(trTrack) # This is the plot I want to produce # ----- APPROACH 2 tst2 <- GenomicFeatures::exons( txdbhg19, vals = list(tx_name = transcriptNames), columns = list("EXONNAME", "TXNAME", "GENEID")) # Problem 1: The TXNAME metadata field now contains transcripts that are not in my original transcript list: unique(unlist(elementMetadata(tst2)$TXNAME[!elementMetadata(tst2)$TXNAME %in% transcriptNames])) # [1] "ENST00000370031" "ENST00000402983" # I can hack this out, but it feels unclean to do this: elementMetadata(tst2)$TXNAME <- elementMetadata(tst2)$TXNAME[elementMetadata(tst2)$TXNAME %in% transcriptNames]
# I would much rather have it so that I only get the transcript names that I originally wanted. Anyhow, let's continue..

# Each entry in elementMetadata(tst2)$TXNAME now has 1 or more transcript names in it. This is not what I want. I want to somehow "expand" this GRanges object, so that each row only have one TXNAME value. Here's a very naive way to do it: gr <- GRanges() for (i in 1:length(tst2)) { trNames <- elementMetadata(tst2[i])$TXNAME[[1]]
trNames <- strsplit(trNames, " ")
for (j in 1:length(trNames)) {
newGr <- tst2[i]
elementMetadata(newGr)$TXNAME <- trNames[[j]] gr <- append(gr, newGr) print(paste(elementMetadata(tst2[i])$GENEID, trNames[[j]]))
}
}
elementMetadata(gr)$transcript <- elementMetadata(gr)$TXNAME

trTrack <- Gviz::GeneRegionTrack(gr)
Gviz::plotTracks(gr)
# Error in (function (classes, fdef, mtable)  :
# unable to find an inherited method for function ‘displayPars<-’ for signature ‘"GRanges", "list"’

# Why do I get this error? I was hoping that this gr structure would be similar to the tst structure in approach 1


The goal is to get a structure similar to the one I get with exonsBy() -> unlist(). Any ideas?

GenomicFeatures • 2.5k views