Question: Generate gene level TPM from quant.sf file generated in salmon
0
gravatar for pyrhana25
12 months ago by
pyrhana250
pyrhana250 wrote:

Hi,

I have a quant.sf table generated from salmon:

Name Length EffectiveLength TPM NumReads ENST00000632684.1 12 13.000 0.000000 0.000 ENST00000434970.2 9 10.000 0.000000 0.000 ENST00000448914.1 13 14.000 0.000000 0.000 ENST00000415118.1 8 9.000 0.000000 0.000 ENST00000390567.1 20 21.000 0.000000 0.000 ENST00000439842.1 11 12.000 0.000000 0.000 ENST00000454908.1 17 18.000 0.000000 0.000 ENST00000390583.1 31 32.000 0.000000 0.000 ENST00000390572.1 28 29.000 0.000000 0.000

I need to convert this to the gene level and also to have the identifier as gene symbols.

I realise you can use txtimport ( https://www.hadriengourle.com/tutorials/rna/ ) to do the former. Although the script is set up for multiple samples with a sample table etc. I only have this one table that I need to convert from the transcript level to the gene level.

Any help from anyone who can see an easy way to do this using a gtf file would be appreciated.

Best wishes,

Jason

tpm rna-seq salmon txtimport R • 1.1k views
ADD COMMENTlink modified 12 months ago by i.sudbery9.1k • written 12 months ago by pyrhana250
3
gravatar for i.sudbery
12 months ago by
i.sudbery9.1k
Sheffield, UK
i.sudbery9.1k wrote:

The TPM of a gene is simply the sum of the TPMs of the transcripts.

What you need is a table mapping the transcript_ids to gene symbols, and a simple, group, summerise(sum) operation.

The easiest way to do this would be to use R, dplyr and the "EnsDb" pacakges (for example, EnsDb.Hsapiens.v86. This is a bit out of date now and there is probably a more recent one)

Then you would do something like:

library(EnsDb.Hsapiens.v86)
library(dplyr)
salmon_output = read.delim("quant.sf")
tx2gene = transcripts(EnsDb.Hsapiens.v86, 
                      columns=c("tx_id", "gene_name", "gene_id"), 
                      return.type="DataFrame")
gene_tpms <- salmon_output %>%
    inner_join(tx2gene, by=c("Name"="tx_id")) %>%
    group_by(gene_id, gene_name) %>%
    summerise(TPM=sum(TPM))

If you REALLY wanted to do it from the GTF, then you'd need to extract a list of transcript_ids and symbols from the GTF. I guess some thing like:

  grep gene_name ensembl.gtf \
| grep transcript_id \
| sed -E 's/.+transcript_id \"(ENST[0-9]+)\";.*gene_name \"([^\"]+)\";.+/\1\t\2/' > tx2gene.tev

would do the trick, and then use dplyr to sum it like above.

ADD COMMENTlink modified 12 months ago • written 12 months ago by i.sudbery9.1k

When I try solution1, I always get the error

   Error in UseMethod("tbl_vars") :  no applicable method for 'tbl_vars' applied to an object of class "c('DataFrame', 'DataTable', 'SimpleList', 'DataTable_OR_NULL', 'List', 'Vector', 'list_OR_List', 'Annotated')"
ADD REPLYlink written 9 weeks ago by rykerklie710
1
tx2gene = transcripts(EnsDb.Hsapiens.v86, 
                  columns=c("tx_id", "gene_name", "gene_id"), 
                  return.type="DataFrame")

Should be

tx2gene = transcripts(EnsDb.Hsapiens.v86, 
                      columns=c("tx_id", "gene_name", "gene_id"), 
                      return.type="data.frame")

The return type was one that dplyr couldn't operate on.

ADD REPLYlink written 9 weeks ago by i.sudbery9.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1198 users visited in the last hour