Question

Generate gene level TPM from quant.sf file generated in salmon

0

Entering edit mode

4.7 years ago

pyrhana25 • 0

Hi,

I have a quant.sf table generated from salmon:

Name Length EffectiveLength TPM NumReads ENST00000632684.1 12 13.000 0.000000 0.000 ENST00000434970.2 9 10.000 0.000000 0.000 ENST00000448914.1 13 14.000 0.000000 0.000 ENST00000415118.1 8 9.000 0.000000 0.000 ENST00000390567.1 20 21.000 0.000000 0.000 ENST00000439842.1 11 12.000 0.000000 0.000 ENST00000454908.1 17 18.000 0.000000 0.000 ENST00000390583.1 31 32.000 0.000000 0.000 ENST00000390572.1 28 29.000 0.000000 0.000

I need to convert this to the gene level and also to have the identifier as gene symbols.

I realise you can use txtimport ( https://www.hadriengourle.com/tutorials/rna/ ) to do the former. Although the script is set up for multiple samples with a sample table etc. I only have this one table that I need to convert from the transcript level to the gene level.

Any help from anyone who can see an easy way to do this using a gtf file would be appreciated.

Best wishes,

Jason

RNA-Seq salmon txtimport R TPM • 3.7k views

ADD COMMENT • link updated 4.7 years ago by i.sudbery 19k • written 4.7 years ago by pyrhana25 • 0

score 3 · Answer 1 · 2019-09-04

The TPM of a gene is simply the sum of the TPMs of the transcripts.

What you need is a table mapping the transcript_ids to gene symbols, and a simple, group, summerise(sum) operation.

The easiest way to do this would be to use R, dplyr and the "EnsDb" pacakges (for example, EnsDb.Hsapiens.v86. This is a bit out of date now and there is probably a more recent one)

Then you would do something like:

library(EnsDb.Hsapiens.v86)
library(dplyr)
salmon_output = read.delim("quant.sf")
tx2gene = transcripts(EnsDb.Hsapiens.v86, 
                      columns=c("tx_id", "gene_name", "gene_id"), 
                      return.type="DataFrame")
gene_tpms <- salmon_output %>%
    inner_join(tx2gene, by=c("Name"="tx_id")) %>%
    group_by(gene_id, gene_name) %>%
    summerise(TPM=sum(TPM))

If you REALLY wanted to do it from the GTF, then you'd need to extract a list of transcript_ids and symbols from the GTF. I guess some thing like:

  grep gene_name ensembl.gtf \
| grep transcript_id \
| sed -E 's/.+transcript_id \"(ENST[0-9]+)\";.*gene_name \"([^\"]+)\";.+/\1\t\2/' > tx2gene.tev

would do the trick, and then use dplyr to sum it like above.