Question: Generate gene level TPM from quant.sf file generated in salmon
gravatar for pyrhana25
12 months ago by
pyrhana250 wrote:


I have a quant.sf table generated from salmon:

Name Length EffectiveLength TPM NumReads ENST00000632684.1 12 13.000 0.000000 0.000 ENST00000434970.2 9 10.000 0.000000 0.000 ENST00000448914.1 13 14.000 0.000000 0.000 ENST00000415118.1 8 9.000 0.000000 0.000 ENST00000390567.1 20 21.000 0.000000 0.000 ENST00000439842.1 11 12.000 0.000000 0.000 ENST00000454908.1 17 18.000 0.000000 0.000 ENST00000390583.1 31 32.000 0.000000 0.000 ENST00000390572.1 28 29.000 0.000000 0.000

I need to convert this to the gene level and also to have the identifier as gene symbols.

I realise you can use txtimport ( ) to do the former. Although the script is set up for multiple samples with a sample table etc. I only have this one table that I need to convert from the transcript level to the gene level.

Any help from anyone who can see an easy way to do this using a gtf file would be appreciated.

Best wishes,


tpm rna-seq salmon txtimport R • 1.1k views
ADD COMMENTlink modified 12 months ago by i.sudbery9.1k • written 12 months ago by pyrhana250
gravatar for i.sudbery
12 months ago by
Sheffield, UK
i.sudbery9.1k wrote:

The TPM of a gene is simply the sum of the TPMs of the transcripts.

What you need is a table mapping the transcript_ids to gene symbols, and a simple, group, summerise(sum) operation.

The easiest way to do this would be to use R, dplyr and the "EnsDb" pacakges (for example, EnsDb.Hsapiens.v86. This is a bit out of date now and there is probably a more recent one)

Then you would do something like:

salmon_output = read.delim("quant.sf")
tx2gene = transcripts(EnsDb.Hsapiens.v86, 
                      columns=c("tx_id", "gene_name", "gene_id"), 
gene_tpms <- salmon_output %>%
    inner_join(tx2gene, by=c("Name"="tx_id")) %>%
    group_by(gene_id, gene_name) %>%

If you REALLY wanted to do it from the GTF, then you'd need to extract a list of transcript_ids and symbols from the GTF. I guess some thing like:

  grep gene_name ensembl.gtf \
| grep transcript_id \
| sed -E 's/.+transcript_id \"(ENST[0-9]+)\";.*gene_name \"([^\"]+)\";.+/\1\t\2/' > tx2gene.tev

would do the trick, and then use dplyr to sum it like above.

ADD COMMENTlink modified 12 months ago • written 12 months ago by i.sudbery9.1k

When I try solution1, I always get the error

   Error in UseMethod("tbl_vars") :  no applicable method for 'tbl_vars' applied to an object of class "c('DataFrame', 'DataTable', 'SimpleList', 'DataTable_OR_NULL', 'List', 'Vector', 'list_OR_List', 'Annotated')"
ADD REPLYlink written 9 weeks ago by rykerklie710
tx2gene = transcripts(EnsDb.Hsapiens.v86, 
                  columns=c("tx_id", "gene_name", "gene_id"), 

Should be

tx2gene = transcripts(EnsDb.Hsapiens.v86, 
                      columns=c("tx_id", "gene_name", "gene_id"), 

The return type was one that dplyr couldn't operate on.

ADD REPLYlink written 9 weeks ago by i.sudbery9.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1198 users visited in the last hour