tximport error (medianLengthOverIsoform)
1
0
Entering edit mode
8 weeks ago
LGG ▴ 10

Hi,

I'm trying to do a DTU analysis in R on my RNA-seq data. I ran Salmon in mapping mode, using Homo_sapiens.GRCh37.cdna.all.fa to create the index.

Now I want to run tximport, but I keep getting this error:

Error in medianLengthOverIsoform(length4CFA, tx2gene, ignoreTxVersion,  :
  all(txId %in% tx2gene$tx) is not TRUE
Calls: tximport -> medianLengthOverIsoform -> stopifnot
Execution halted

Here's my code:

library(GenomicFeatures)
library(tximport)

gtf <- "../Homo_sapiens.GRCh37.87.gtf.gz"
txdb <- makeTxDbFromGFF(gtf)
k <- keys(txdb, keytype="TXNAME")
tx2gene <- AnnotationDbi::select(txdb, keys=k, columns=c("TXNAME", "GENEID"), keytype="TXNAME")

colnames(tx2gene) <- c("tx","gene")

salmon_files <- list.files("/path/to/files")

samples <- data.frame(
  sample = salmon_files,
  path = sprintf("/path/to/files/%s", salmon_files)
)

# Import transcript-level Salmon quantifications
# Build files vector: for tximport, point to quant.sf file paths
files <- file.path(samples$path, "quant.sf")
names(files) <- samples$sample

txi <- tximport(files, type="salmon", txOut = TRUE, tx2gene = tx2gene, countsFromAbundance = "dtuScaledTPM", ignoreTxVersion=TRUE)

Here is the tx2gene file and one of my quant.sf files:

> head(tx2gene)

              tx            gene
1 ENST00000456328 ENSG00000223972
2 ENST00000515242 ENSG00000223972
3 ENST00000518655 ENSG00000223972
4 ENST00000450305 ENSG00000223972
5 ENST00000473358 ENSG00000243485
6 ENST00000469289 ENSG00000243485

>  head(quant.sf)

               Name Length EffectiveLength TPM NumReads
1 ENST00000415118.1      8               8   0        0
2 ENST00000434970.2      9               9   0        0
3 ENST00000448914.1     13              13   0        0
4 ENST00000604642.1     23              23   0        0
5 ENST00000603326.1     19              19   0        0
6 ENST00000604950.1     31              31   0        0

If anyone could help me figure this out I would greatly appreciate it :)

DTU RNA-seq tximport • 752 views
ADD COMMENT
0
Entering edit mode

Is it the version number in your data?

e.g. ENST00000415118.1 vs ENST00000415118 could be causing the mismatch?

ADD REPLY
1
Entering edit mode

I believe using ignoreTxVersion = TRUE is supposed to take care of that

ADD REPLY
0
Entering edit mode
17 days ago
Kevin Blighe ★ 90k

The error arises because not all transcript IDs from your Salmon quant.sf files (after version stripping via ignoreTxVersion=TRUE) are present in the tx2gene mapping derived from the TxDb. This can happen with Ensembl annotations if there are subtle mismatches in the transcript sets between the cDNA fasta and GTF, especially for special loci like VDJ segments.

To resolve this, build tx2gene using biomaRt from the GRCh37 archive, including transcript versions to match your quant.sf exactly.

Install biomaRt if needed: BiocManager::install("biomaRt").

Then run:

library(biomaRt)
mart <- useMart(host="grch37.ensembl.org", biomart="ensembl", dataset="hsapiens_gene_ensembl")
tx2gene <- getBM(attributes = c("ensembl_transcript_id_version", "ensembl_gene_id"), mart = mart)
colnames(tx2gene) <- c("tx", "gene")

Now update your tximport call with ignoreTxVersion=FALSE:

txi <- tximport(files, type="salmon", txOut=TRUE, tx2gene=tx2gene, countsFromAbundance="dtuScaledTPM", ignoreTxVersion=FALSE)

This should import successfully for DTU analysis (e.g., with DRIMSeq or DEXSeq). If issues persist, check for missing transcripts via setdiff(sub("\\.\\d+$", "", read.delim(files[1])$Name), tx2gene$tx) and report them.

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 3414 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6