Generate gene level TPM from quant.sf file generated in salmon
1
0
Entering edit mode
4.7 years ago
pyrhana25 • 0

Hi,

I have a quant.sf table generated from salmon:

Name Length EffectiveLength TPM NumReads ENST00000632684.1 12 13.000 0.000000 0.000 ENST00000434970.2 9 10.000 0.000000 0.000 ENST00000448914.1 13 14.000 0.000000 0.000 ENST00000415118.1 8 9.000 0.000000 0.000 ENST00000390567.1 20 21.000 0.000000 0.000 ENST00000439842.1 11 12.000 0.000000 0.000 ENST00000454908.1 17 18.000 0.000000 0.000 ENST00000390583.1 31 32.000 0.000000 0.000 ENST00000390572.1 28 29.000 0.000000 0.000

I need to convert this to the gene level and also to have the identifier as gene symbols.

I realise you can use txtimport ( https://www.hadriengourle.com/tutorials/rna/ ) to do the former. Although the script is set up for multiple samples with a sample table etc. I only have this one table that I need to convert from the transcript level to the gene level.

Any help from anyone who can see an easy way to do this using a gtf file would be appreciated.

Best wishes,

Jason

RNA-Seq salmon txtimport R TPM • 3.7k views
ADD COMMENT
3
Entering edit mode
4.7 years ago

The TPM of a gene is simply the sum of the TPMs of the transcripts.

What you need is a table mapping the transcript_ids to gene symbols, and a simple, group, summerise(sum) operation.

The easiest way to do this would be to use R, dplyr and the "EnsDb" pacakges (for example, EnsDb.Hsapiens.v86. This is a bit out of date now and there is probably a more recent one)

Then you would do something like:

library(EnsDb.Hsapiens.v86)
library(dplyr)
salmon_output = read.delim("quant.sf")
tx2gene = transcripts(EnsDb.Hsapiens.v86, 
                      columns=c("tx_id", "gene_name", "gene_id"), 
                      return.type="DataFrame")
gene_tpms <- salmon_output %>%
    inner_join(tx2gene, by=c("Name"="tx_id")) %>%
    group_by(gene_id, gene_name) %>%
    summerise(TPM=sum(TPM))

If you REALLY wanted to do it from the GTF, then you'd need to extract a list of transcript_ids and symbols from the GTF. I guess some thing like:

  grep gene_name ensembl.gtf \
| grep transcript_id \
| sed -E 's/.+transcript_id \"(ENST[0-9]+)\";.*gene_name \"([^\"]+)\";.+/\1\t\2/' > tx2gene.tev

would do the trick, and then use dplyr to sum it like above.

ADD COMMENT
0
Entering edit mode

When I try solution1, I always get the error

   Error in UseMethod("tbl_vars") :  no applicable method for 'tbl_vars' applied to an object of class "c('DataFrame', 'DataTable', 'SimpleList', 'DataTable_OR_NULL', 'List', 'Vector', 'list_OR_List', 'Annotated')"
ADD REPLY
1
Entering edit mode
tx2gene = transcripts(EnsDb.Hsapiens.v86, 
                  columns=c("tx_id", "gene_name", "gene_id"), 
                  return.type="DataFrame")

Should be

tx2gene = transcripts(EnsDb.Hsapiens.v86, 
                      columns=c("tx_id", "gene_name", "gene_id"), 
                      return.type="data.frame")

The return type was one that dplyr couldn't operate on.

ADD REPLY

Login before adding your answer.

Traffic: 1635 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6