Converting TCGA expression data from FPKM to TPM
2
5
Entering edit mode
6.2 years ago

For a given cancer type in the NIH Cancer Genome Atlas, I visit the data portal and download UNC RNASeqV2, level 3 expression data. Specifically, I grab files that end with the extension *.rsem.genes.normalized_results

Each file contains one line per gene, with the gene name and (I assume) its normalized FPKM expression value. I am assuming these data are normalized FPKM based on the filename and the UNC RNASeqV2 protocol description hosted on TCGA.

My questions are:

1. Are these expression data really measured in FPKM?
2. If they are, how should I convert from FPKM to TPM, for all the expression values for a given gene?

tcga fpkm tpm expression rna-seq • 19k views
1
Entering edit mode

You can't recover TPMs from gene-level FPKMs. The data on transcripts has already been lost.

0
Entering edit mode

I don't understand your comment. I've quickly compared the FPKMs for a given gene and it's transcripts and noticed (as one could expect) that the gene-level FPKM is the sum of all FPKM of it's transcripts. So it would not really make a difference if you calculate the TPM from gene or transcript-level FPKMs, I conclude. Hereafter one example:

genes.fpkm_tracking:ENSG00000196092    ENSG00000196092    PAX5    29.5427

isoforms.fpkm_tracking:ENST00000358127    ENSG00000196092    PAX5    5.41329
isoforms.fpkm_tracking:ENST00000520154    ENSG00000196092    PAX5    2.55302e-10
isoforms.fpkm_tracking:ENST00000523241    ENSG00000196092    PAX5    2.06415e-12
isoforms.fpkm_tracking:ENST00000377840    ENSG00000196092    PAX5    8.02239e-16
isoforms.fpkm_tracking:ENST00000377852    ENSG00000196092    PAX5    0.561218
isoforms.fpkm_tracking:ENST00000377853    ENSG00000196092    PAX5    5.90173e-10
isoforms.fpkm_tracking:ENST00000523145    ENSG00000196092    PAX5    2.63708e-10
isoforms.fpkm_tracking:ENST00000446742    ENSG00000196092    PAX5    0.387949
isoforms.fpkm_tracking:ENST00000520281    ENSG00000196092    PAX5    0.00123482
isoforms.fpkm_tracking:ENST00000377847    ENSG00000196092    PAX5    20.8611
isoforms.fpkm_tracking:ENST00000522003    ENSG00000196092    PAX5    0.474374
isoforms.fpkm_tracking:ENST00000414447    ENSG00000196092    PAX5    0.995077
isoforms.fpkm_tracking:ENST00000523493    ENSG00000196092    PAX5    0.663389
isoforms.fpkm_tracking:ENST00000524340    ENSG00000196092    PAX5    1.70926e-82
isoforms.fpkm_tracking:ENST00000522932    ENSG00000196092    PAX5    0
isoforms.fpkm_tracking:ENST00000520083    ENSG00000196092    PAX5    0.185101

0
Entering edit mode

Are you sure you need the TPM (Transcripts Per Million) data? If you are fine with the data at the gene level you should be OK as it is

0
Entering edit mode

I'd like the TPM data, if possible.

0
Entering edit mode

Quick question. If I have between sample normalized FPKMs, do I just sum the FPKMs of all the transcripts for a given gene within a sample, or do I sum all of those and for all those transcripts in the other samples. I'm just thinking, if you have three transcripts and two samples, that is different maths.

0
Entering edit mode

You need to post this as a new question and refer back to this thread if necessary. Each thread starts with a question followed by answers - new questions should not be posted in the answer section. That's what makes this site better than others. (Moderation: your answer will be moved to a comment)

8
Entering edit mode
6.2 years ago
h.mon 33k

At the end of this blog post, a simple formula is provided to compute TPM from FPKM:

TPMi=( FPKMi / sum(FPKMj ) * 10^6

edit: well, from the protocol you linked, and also from this wiki, the UNC V2 RNA-Seq Workflow uses MapSplice+RSEM, so I guess measures are already given as TPM - check here and here.

0
Entering edit mode

Thanks, the wiki link was a much better summary than what I had found previously.

2
Entering edit mode
4.2 years ago
fabio-verdao ▴ 20

Maybe it's a little bit old, but just for future access...

For your first question: 1. Are these expression data really measured in FPKM?

Following the wiki cited by @h.mon, *.rsem.genes.normalized_results as well as *.rsem.isoforms.normalized_results have measures in normalized_count (upper quartile normalized RSEM count estimates) and not RPKM, FPKM or TPM.

0
Entering edit mode

Hello, so can I directly use the data in *.rsem.genes.normalized_results to do differential analysis, or pathway enrichment analysis, etc? Thanks!