Question: error when run tximport for salmon files
1
gravatar for Lila M
10 days ago by
Lila M 370
UK
Lila M 370 wrote:

Hi guys, I'm trying to analyze some RNA-seq data using salmon as follow:

#create the index:
salmon index -t gencode.v27.transcripts.fa -i human_index

#cretae the quant.sf files:
salmon quant -i human_index/ -l OSR -1 R1.fastq -2 R2.fastq -o salmon_quant

After that, my idea is to process all the files (1Q_S1_quant.sf, 2Q_S2_quant.sf .....16Q_S16_quant.sf) in R for downstream analysis with DESeq2, to do that I've tried:

library(GenomicFeatures)
library(tximport)
library(readr)
library(rjson)

## Create a transcript-to-gene matching table (tx2gene) that will be used to aggregate transcript quantifications 
## Salmon to the gene level

txdb <-makeTxDbFromGFF("gencode.v27.annotation.gtf")
k <- keys(txdb, keytype = "GENEID")
df <- select(txdb, keys = k,  columns = "TXNAME", keytype = "GENEID")
tx2gene <- df[, 2:1]
head(tx2gene)

## load salmon files
files <- list.files( pattern = "quant.sf",full.names = TRUE)
names(files) <- paste0("sample", 1:16)
all(file.exists(files))
#TRUE

txi_salmon <- tximport(files = files, type = "salmon", txOut = FALSE, tx2gene = tx2gene)reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
Error in summarizeToGene(txi, tx2gene, ignoreTxVersion, countsFromAbundance) : 
    None of the transcripts in the quantification files are present
  in the first column of tx2gene. Check to see that you are using
  the same annotation for both.

But that is not true at all, because I look in both files (quant.sf and tx2gene) and the same transcript for the same gene is present in both files (eg):

#tx2gene
TXNAME                    GENEID
ENST00000373031.4   ENSG00000000005.5
ENST00000485971.1   ENSG00000000005.5

#1Q_S1.quant.sf
ENST00000373031.4|ENSG00000000005.5|OTTHUMG00000022001.1|OTTHUMT00000057481.1|TNMD-201|TNMD|1339|protein_coding|    1339    1156.86 0   0
ENST00000485971.1|ENSG00000000005.5|OTTHUMG00000022001.1|OTTHUMT00000057482.1|TNMD-202|TNMD|542|processed_transcript|   542 360.895 0   0

Any suggestions about what's going on with this funny error?

Thanks!

ADD COMMENTlink modified 10 days ago • written 10 days ago by Lila M 370
1

Hint: compare the first columns of the two files your posted. You'll note that they're not exactly the same. That's causing the error.

ADD REPLYlink written 10 days ago by Devon Ryan71k

Hi Devon, can you explain how can I solve it? Thanks!

ADD REPLYlink written 10 days ago by Lila M 370
1

You can probably do something like sed -e 's/\|.*\t/\t/' 1Q_S1.quant.sf.

ADD REPLYlink modified 10 days ago • written 10 days ago by Devon Ryan71k
2
gravatar for e.rempel
10 days ago by
e.rempel510
Germany, Heidelberg, COS
e.rempel510 wrote:

It looks like they are different after all, since ENST00000373031.4 is not ENST00000373031.4|ENSG00000000005.5|OTTHUMG00000022001.1|OTTHUMT00000057481.1|TNMD-201|TNMD|1339|protein_coding|. You could split the names in XXX.quant.sf files using limma::strsplit2 using "|" as separator.

ADD COMMENTlink written 10 days ago by e.rempel510

I'm a bit stuck here, can you please let me know how to do that? or at which point? Thanks!

ADD REPLYlink written 10 days ago by Lila M 370
1

After you have checked that all files are here, you could do something like

rownames(1Q_S1.quant.sf) <- limma::strsplit2(rownames(1Q_S1.quant.sf), split = "|", fixed = T)[,1])

meaning that you split your rownames taking | as separator and then take only the first entry

ADD REPLYlink written 10 days ago by e.rempel510
1
gravatar for Lila M
10 days ago by
Lila M 370
UK
Lila M 370 wrote:

problem solved! Thanks!

ADD COMMENTlink modified 10 days ago • written 10 days ago by Lila M 370
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 967 users visited in the last hour