error when run tximport for salmon files
3.9 years ago
Lila M ★ 1.1k

Hi guys, I'm trying to analyze some RNA-seq data using salmon as follow:

#create the index:
salmon index -t gencode.v27.transcripts.fa -i human_index

#cretae the quant.sf files:
salmon quant -i human_index/ -l OSR -1 R1.fastq -2 R2.fastq -o salmon_quant


After that, my idea is to process all the files (1Q_S1_quant.sf, 2Q_S2_quant.sf .....16Q_S16_quant.sf) in R for downstream analysis with DESeq2, to do that I've tried:

library(GenomicFeatures)
library(tximport)
library(rjson)

## Create a transcript-to-gene matching table (tx2gene) that will be used to aggregate transcript quantifications
## Salmon to the gene level

txdb <-makeTxDbFromGFF("gencode.v27.annotation.gtf")
k <- keys(txdb, keytype = "GENEID")
df <- select(txdb, keys = k,  columns = "TXNAME", keytype = "GENEID")
tx2gene <- df[, 2:1]

files <- list.files( pattern = "quant.sf",full.names = TRUE)
names(files) <- paste0("sample", 1:16)
all(file.exists(files))
#TRUE

txi_salmon <- tximport(files = files, type = "salmon", txOut = FALSE, tx2gene = tx2gene)reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Error in summarizeToGene(txi, tx2gene, ignoreTxVersion, countsFromAbundance) :
None of the transcripts in the quantification files are present
in the first column of tx2gene. Check to see that you are using
the same annotation for both.


But that is not true at all, because I look in both files (quant.sf and tx2gene) and the same transcript for the same gene is present in both files (eg):

#tx2gene
TXNAME                    GENEID
ENST00000373031.4   ENSG00000000005.5
ENST00000485971.1   ENSG00000000005.5

#1Q_S1.quant.sf
ENST00000373031.4|ENSG00000000005.5|OTTHUMG00000022001.1|OTTHUMT00000057481.1|TNMD-201|TNMD|1339|protein_coding|    1339    1156.86 0   0
ENST00000485971.1|ENSG00000000005.5|OTTHUMG00000022001.1|OTTHUMT00000057482.1|TNMD-202|TNMD|542|processed_transcript|   542 360.895 0   0


Any suggestions about what's going on with this funny error?

Thanks!

Hint: compare the first columns of the two files your posted. You'll note that they're not exactly the same. That's causing the error.

Hi Devon, can you explain how can I solve it? Thanks!

You can probably do something like sed -e 's/\|.*\t/\t/' 1Q_S1.quant.sf.

3.9 years ago
e.rempel ★ 1.0k

It looks like they are different after all, since ENST00000373031.4 is not ENST00000373031.4|ENSG00000000005.5|OTTHUMG00000022001.1|OTTHUMT00000057481.1|TNMD-201|TNMD|1339|protein_coding|. You could split the names in XXX.quant.sf files using limma::strsplit2 using "|" as separator.

I'm a bit stuck here, can you please let me know how to do that? or at which point? Thanks!

After you have checked that all files are here, you could do something like

rownames(1Q_S1.quant.sf) <- limma::strsplit2(rownames(1Q_S1.quant.sf), split = "|", fixed = T)[,1])


meaning that you split your rownames taking | as separator and then take only the first entry

I have a follow up question...

If I'm using file.path to import all my quant.sf files into R, is there a way of correcting this space issue for all files? I'm getting the same error message and I know it is because of the lack of a space between my transcript_id and the "|".

 dir <- "/mnt/data/BM/Total_RNAseq/salmon/protein_coding"

files <- file.path(dir, samplefile\$sampleID, "quant.sf")

annotation_transcript <- elementMetadata(import(gtf_file, feature.type = "transcript"))

tx2gene <- annotation_transcript[,c("transcript_id", "gene_id")]

txi.salmon <- tximport(files, type = "salmon", tx2gene = tx2gene)


3.9 years ago
Lila M ★ 1.1k

problem solved! Thanks!