Question

Discrepancy between abundance.tsv and tx2gene.csv

0

Entering edit mode

6.4 years ago

Mozart ▴ 330

So I am testing the Kallisto/DESeq2 pipeline and I am now struggling with tximport as I need to manage the tables obtained in the analysis carried out so far prior to launch DESeq2. For each sample I have an abundance.tsv file and I need to combine(?) it with the .csv file that I created ad hoc (with known genes/transcript correlations). So far, there's a sort of discrepancy with the annotation process as for example in my abundance file I have something like this:

ENSMUST00000103493.2

but I would like to obtain something like this

ENSMUST00000103493

in order to be recognised in my transcript2gene.csv file.

Here's my strings of code:

dir <- system.file("extdata", package = "tximportData")
list.files(dir)
samples <- read.table(file.path(dir, "samples.txt"), header = TRUE)
library(GenomicFeatures)

txdb <-txdb <- select(org.Mm.eg.db, keys(org.Mm.eg.db), "ACCNUM") 
txdb
k <- keys(txdb, keytype = "GENEID")
k
df <- select(txdb, keys = k, keytype = "GENEID", columns = "TXNAME")
df

'select()' returned 1:many mapping between keys and columns

tx2gene <- df[, 2:1]
head(tx2gene)

#  TXNAME             GENEID
#1 ENSMUST00000000001 ENSMUSG00000000001
#2 ENSMUST00000000003 ENSMUSG00000000003
#3 ENSMUST00000114041 ENSMUSG00000000003
#4 ENSMUST00000000028 ENSMUSG00000000028
#5 ENSMUST00000096990 ENSMUSG00000000028
#6 ENSMUST00000115585 ENSMUSG00000000028

then I write the results as a csv file

write.csv(tx2gene, file = "/tx2gene.csv")

files <- file.path(dir, "kallisto", samples$run, "abundance.tsv")
names(files) <- paste0("sample", 1:6)
txi.kallisto.tsv <- tximport(files, type = "kallisto", tx2gene = tx2gene)
head(txi.kallisto.tsv$counts)

Note: importing `abundance.h5` is typically faster than `abundance.tsv`
reading in files with read_tsv
1 2 3 4 5 6 
Error in summarizeToGene(txi, tx2gene, ignoreTxVersion, countsFromAbundance) : 

  None of the transcripts in the quantification files are present
  in the first column of tx2gene. Check to see that you are using
  the same annotation for both.

Any useful hints?

RNA-Seq • 3.1k views

ADD COMMENT • link updated 6.4 years ago by erwan.scaon ▴ 940 • written 6.4 years ago by Mozart ▴ 330

score 3 · Accepted Answer · 2017-11-16

3

Entering edit mode

6.4 years ago

erwan.scaon ▴ 940

If you want to convert ENSMUST00000103493.2 -> ENSMUST00000103493 in your Kallisto abundance.tsv files, you can do the following :

for f in *.tsv;
do awk -F '\t' -v OFS='\t' 'NR > 1 {sub(/\.[0-9]*/, "", $1)} 1' $f > ${f%%.*}"_renamed.tsv";
done;