Question: Using transcripts per million (TPM)
gravatar for Tom Harrop
5.3 years ago by
Tom Harrop150
IRD, Montpellier, France
Tom Harrop150 wrote:

Hi BioStars,

I have two questions about using TPM (transcripts per million). I've read some papers on the calculation and some blog and forum posts so I have some understanding of what it is. The true analysis for this experiment was with raw counts and vst expression values, and I'm basically just having a look at TPM out of interest.

My questions:

1. Is it valid to calculate TPM from DESeq2's normalised counts, i.e. counts(dds, normalized = TRUE), or do I have to use the raw, raw counts? I tried both and there didn't seem to be a great deal of difference (actually my TPM results aren't that different to using normalized raw counts for the genes I've looked at, in either case) but I haven't tested it thoroughly.

2. I understand why one shouldn't compare TPM between samples, since the total expression rates, rRNA component etc. varies sample-to-sample. I'm just wondering if this would be less of a problem in the case where data from three biological replicates were available?

Thanks for reading and have a nice Friday,


rna-seq tpm deseq2 • 35k views
ADD COMMENTlink modified 3.7 years ago by Biostar ♦♦ 20 • written 5.3 years ago by Tom Harrop150
gravatar for karl.stamm
5.3 years ago by
United States
karl.stamm3.6k wrote:

For question 1) TPM is not readcount. Normalized readcount is for scaling the sample sequencing depth, and TPM is about transcripts, completely inferred by an advanced model where long genes will get more reads, and using spliced reads to infer isoform usage. In that way it's like Tophat's Cuffnorm for FPKM. The only tool I know that makes TPM is RSEM. 

For question 2) comparing different kinds of samples will suffer bias if the distribution of mRNAs is very different, but biological replicates are as close as possible, so that IS the appropriate place to compare values. 

ADD COMMENTlink written 5.3 years ago by karl.stamm3.6k

Hi Karl,

Thanks for the reply.

I'm sorry if my first question wasn't clear. I realise TPM is not read count—I manually calculated TPM from normalised read count (and, separately, from raw read count) using the gene lengths from my GTF file. I don't know whether it's valid to use the normalised counts instead of the raw counts in the TPM calculation.


ADD REPLYlink written 5.3 years ago by Tom Harrop150


could you please tell me which is the formula that you use to manually calculate TPM?

I'm getting a little bit confused since I'm trying to find an "unambiguous" one and I found these 3 links, that don't say exactly the same thing.,%20Kin,%20and%20Lynch%20%282012%29.pdf

I used RSEM to calculate expression, but I need a TPM estimate for a gene that I can't take from the RSEM output (don't ask, it's complicated :) )

in particular, using the formula from the Dewey presentation, (10^6 * Z * ( C_i/ L'_i * N) ), i'm trying to understand what exactly Z stands for. it should be a normalization parameter so it should to be the same for all the transcripts ( am I right?), but when I try to extrapolate its value from the TPM values of the RSEM output (basically Z= TPM_value / (10^6 *c_i / L_i * N)  I get different results for Z ( the values oscillate a little bit around a constant number).


ADD REPLYlink written 5.1 years ago by biola10

Hi, the formula I used in R was lifted from here (it's the same as the Wagner paper).

ADD REPLYlink written 5.1 years ago by Tom Harrop150

Thanks @biola. I need to normalize my Htseq-Count data based on TPM. I read your code but in my case, I have 20000 genes(rows) and 259 columns(samples). how to apply your TPM function for that matrix?

ADD REPLYlink written 5 months ago by modarzi120

Sorry, if I have extracted a list of differentially expressed genes by edgeR, does this make sense to use Transcripts Per Million (TPM) normalized data for co-expression analysis????? I mean, firstly, I defined DE genes from raw read counts by edgeR but as I had Transcripts Per Million (TPM) file, I extracted DE genes defined by edgeR from Transcripts Per Million (TPM) file and used for network construction.

ADD REPLYlink written 2.3 years ago by A3.8k
gravatar for SP
4.0 years ago by
SP250 wrote:

Just for the sake of putting TPM formula in readable format:

TPM = ((tag count for transcript n* read length) / length of transcript n) * 1million / normalizing term

normalizing term = sum((number of tag for transcript n * read length)/ length of transcript n) for all transcripts

For better understanding read RPKM inconsistencies with example and

This might also be useful

ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by SP250
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 805 users visited in the last hour