Question: Generate gene level TPM from quant.sf file generated in salmon
0
gravatar for pyrhana25
5 months ago by
pyrhana250
pyrhana250 wrote:

Hi,

I have a quant.sf table generated from salmon:

Name Length EffectiveLength TPM NumReads ENST00000632684.1 12 13.000 0.000000 0.000 ENST00000434970.2 9 10.000 0.000000 0.000 ENST00000448914.1 13 14.000 0.000000 0.000 ENST00000415118.1 8 9.000 0.000000 0.000 ENST00000390567.1 20 21.000 0.000000 0.000 ENST00000439842.1 11 12.000 0.000000 0.000 ENST00000454908.1 17 18.000 0.000000 0.000 ENST00000390583.1 31 32.000 0.000000 0.000 ENST00000390572.1 28 29.000 0.000000 0.000

I need to convert this to the gene level and also to have the identifier as gene symbols.

I realise you can use txtimport ( https://www.hadriengourle.com/tutorials/rna/ ) to do the former. Although the script is set up for multiple samples with a sample table etc. I only have this one table that I need to convert from the transcript level to the gene level.

Any help from anyone who can see an easy way to do this using a gtf file would be appreciated.

Best wishes,

Jason

tpm rna-seq salmon txtimport R • 588 views
ADD COMMENTlink modified 5 months ago by i.sudbery7.0k • written 5 months ago by pyrhana250
1
gravatar for i.sudbery
5 months ago by
i.sudbery7.0k
Sheffield, UK
i.sudbery7.0k wrote:

The TPM of a gene is simply the sum of the TPMs of the transcripts.

What you need is a table mapping the transcript_ids to gene symbols, and a simple, group, summerise(sum) operation.

The easiest way to do this would be to use R, dplyr and the "EnsDb" pacakges (for example, EnsDb.Hsapiens.v86. This is a bit out of date now and there is probably a more recent one)

Then you would do something like:

library(EnsDb.Hsapiens.v86)
library(dplyr)
salmon_output = read.delim("quant.sf")
tx2gene = transcripts(EnsDb.Hsapiens.v86, 
                      columns=c("tx_id", "gene_name", "gene_id"), 
                      return.type="DataFrame")
gene_tpms <- salmon_output %>%
    inner_join(tx2gene, by=c("Name"="tx_id")) %>%
    group_by(gene_id, gene_name) %>%
    summerise(TPM=sum(TPM))

If you REALLY wanted to do it from the GTF, then you'd need to extract a list of transcript_ids and symbols from the GTF. I guess some thing like:

  grep gene_name ensembl.gtf \
| grep transcript_id \
| sed -E 's/.+transcript_id \"(ENST[0-9]+)\";.*gene_name \"([^\"]+)\";.+/\1\t\2/' > tx2gene.tev

would do the trick, and then use dplyr to sum it like above.

ADD COMMENTlink modified 5 months ago • written 5 months ago by i.sudbery7.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1779 users visited in the last hour