Question: Calculate TPM from cuffnorm count data
2
21 months ago by
galozs20
galozs20 wrote:

Hi,

I am working with RNA-seq data and normalizing with Cufflinks' Cuffnorm, whose output is count and FPKM matrices. since I'm comparing gene expression between samples, I wanted to use TPM rather than FPKM. I was wondering about few issues: 1- Can I calculate TPM values directly from the FPKM values of Cuffnorm? 2- Can I calculate TPM from Cuffnorm's count table? Is it raw enough? 3- What is unique in Cuffnorm's FPKM vs. regular FPKM?

Thanks!

rna-seq cuffnorm tpm cufflinks fpkm • 1.5k views
modified 21 months ago by Macspider3.0k • written 21 months ago by galozs20
0
21 months ago by
Macspider3.0k
Vienna - BOKU
Macspider3.0k wrote:

To calculate TPM from FPKM use the formula proposed by Pachter et al, 2011, here:

$enter image description here$

To fully understand this formula: What the FPKM?

Thanks! Do you know if it is OK to do it with the FPKM output values of Cuffnorm? I used the formula that you mentioned and got very wired results when calculated correlation matrix. So I thought that maybe Cuffnorm's FPKM are already manipulated in a way...

From every software output you compute the FPKM, you should retrieve the same number, if you start from the same counts, otherwise that's some great bug in data science. The sum of the FPKM on the denominator of the fraction is belonging to the sample, not the whole data set. Does that help? Did you multiply by 10 to the power of 6? Many times I forget to do that...

That what I was thinking though the end values where very strange. (I checked that all the steps were done correctly according to the formula) these are pearson correlations of the log2 FPKM data:

and these are the pearson correlations of the log2 TPM: as you can see the TPM looks realy bad. Do you have an idea? where is my mistake?

TPM is not just a rearrangement of FPKM, it contains a sample-specific factor. Therefore, it might mean that your two samples are not that similar after all. However, you should still remove low values (scale the threshold from FPKM=1 to TPM=<val> for each of your samples).

What do you mean by "scale the threshold from FPKM=1 to TPM=<val> for each of your samples" ? both correlation matrix I attached were calculated separately on the FPKM / TPM log2 values. Do you mean that the noise threshold in is different for each one of the samples? For example if I used log2(FPKM) = 5 as a threshold than in TPM I will have a different one for each sample?

Thanks!

Exactly.

• If you look carefully, you'll see that the fraction includes the FPKM value of gene i as numerator, and the sum of all the FPKM values of sample j as denominator.
• This means that, for each sample, you'll have a different sum at the denominator

This being said, to obtain a sample-specific threshold, you will apply 5 as numerator and then calculate the sum of the FPKMs for each sample to divide 5 with. The fraction is then multiplied by 10 to the power of 6, giving you the TPM version of FPKM=5, relativized to the total sample expresssion, for each sample.