Question: Can I use TCGA FPKM-UQ values directly to compare across samples without any preprocessing?
gravatar for robles.daniela
2.4 years ago by
United Kingdom
robles.daniela60 wrote:

Dear all,

This is a newbie question :)

I'm building a linear model to identify significant predictors of mutation count/types in tumours from TCGA. I want to include expression levels of a couple of genes, but I am quite new to RNA-Seq analyses and best practices. TCGA provides RNA-Seq data at the gene level in three formats: HTSeq-counts, FPKM and FPKM-UQ. I have been reading (tutorials and the questions here) and asking around and I have reached the conclusion that I can use FPKM-UQ values to compare across samples without any further pre-processing - Is this true? Or would you recommend doing pre-processing to these values before comparing?

Thanks so much, Daniela

rna-seq tcga fpkm-uq • 4.2k views
ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by robles.daniela60

Thanks so much both! I had seen that chart, Kevin, that is why I thought I could use FPKM-UQ directly. But given your advice and the paper Cindy sent over I will use HTSeq-counts and process through DESeq2 before doing any analyses. I will then compare with the results from using FPKM-UQ directly and post the results here when I have them.

Thanks both again! :)


ADD REPLYlink written 2.4 years ago by robles.daniela60

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

If @Kevin's answer is acceptable you can mark it so (green check mark) to provide closure to this thread.

ADD REPLYlink written 2.4 years ago by genomax80k
gravatar for Kevin Blighe
2.4 years ago by
Kevin Blighe56k
Kevin Blighe56k wrote:

Hi Daniela,

I would highly recommend the HTSeq counts, actually, because these will be raw counts. The normalisation method that produces FPKM counts has come under criticism in recent years and is now not even recommended by some sources. The main issue with this normalisation method is that cross-sample normalisation is non-existent, as such, it's akin to comparing multiple batches without even doing any correcting for batch.

Use HTSeq counts and load these into DESeq2 or EdgeR for downstream analyses.

I have recently analysed an entire TCGA RNAseq dataset (>500 samples) and I used HTSeq counts. They work very well.



Update May 2, 2018:

The TCGA states that "To facilitate cross-sample comparison and differential expression analysis, the GDC also provides Upper Quartile normalized FPKM (UQ-FPKM) values and raw mapping count." -

My original advice still stands, i.e., better to obtain the raw HT-seq counts (where available), and re-process those using an updated normalisation method, like TMM (EdgeR) or geometric (DESeq2). Some TCGA datasets are only available in RSEM counts, which are also possible to use and input to DESeq2 using tximport

ADD COMMENTlink modified 5 months ago • written 2.4 years ago by Kevin Blighe56k

I do agree with Kevin. There was a paper in 2012 comparing the state of the art normalization techniques, and it stated that RPKM/FPKM should not be used and DESeq2 and TMM worked best. Have a look at the paper:



ADD REPLYlink written 2.4 years ago by cindy.perscheid90

Dear Kevin, thanks so much for your answer! This is really helpful. I have one remaining question, I would be grateful if you could help me: I thought the FPKM-UQ was a modification of the FPKM normalisation to precisely allow cross-sample comparison, is this not the case then?

ADD REPLYlink written 2.4 years ago by robles.daniela60

Hi Daniela,

Yes, that is correct, and there are some other posts on Biostars about this topic, like: Differences between FPKM and FPKM-UQ files in gene expression analysis

My suggestion to use HTSeq raw counts is based on a few things:

  • by using raw counts, you have more control over the analysis (FPKM and FPKM-UQ should not be used with common differential expression analysis tools like DESeq2, EdgeR, and Limma, which expect raw counts). If you used FPKM, you would limit the amount of tools/programs that you could use for downstream analyses
  • by not using anything derived from FPKM, you save yourself criticism that would undoubtedly come whenever you tried to publish your work (which could go so far as you having to re-analyse all data depending on the reviewers' comments and the journal involved)
  • by using raw counts, you will have a better opportunity to pick up new skills.

One golden rule in data analysis and bioinformatics is to always aim to get the data in its rawest form possible such that you have most control over how to analyse it. :)

All of this being said, if, at your institute, there are already defined pipelines for analysing FPKM-UQ data, then this may prove the best 'political' option for you.

I also find this very simple flow-chart quite useful in relation to your question:



ADD REPLYlink modified 14 months ago • written 2.4 years ago by Kevin Blighe56k

by not using anything derived from FPKM, you save yourself criticism that would undoubtedly come whenever you tried to publish your work

That's a little extreme. There are a lot of FPKM-based papers in prestigious journals. It's not ideal, but criticism is unlikely.

Of course, it really depends on exactly what you are doing with these FPKMs.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by igor9.8k

I agree with you, but only if the reviewers and journal editors are not up to speed with data analysis normalisation methods, which is probably going to be true for clinically-focused journals where the bioinformatics methods may not even be mentioned or may only appear in the supplementary.

It has been stated in published literature and from various sources that FPKM/RPKM is not ideal. It produces unreliable statistics from differential expression analysis.

ADD REPLYlink written 2.4 years ago by Kevin Blighe56k

An update (6th October 2018):

You should abandon RPKM / FPKM. They are not ideal where cross-sample differential expression analysis is your aim; indeed, they render samples incomparable via differential expression analysis:

Please read this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis

The Total Count and RPKM [FPKM] normalization methods, both of which are still widely in use, are ineffective and should be definitively abandoned in the context of differential analysis.

Also, by Harold Pimental: What the FPKM? A review of RNA-Seq expression units

The first thing one should remember is that without between sample normalization (a topic for a later post), NONE of these units are comparable across experiments. This is a result of RNA-Seq being a relative measurement, not an absolute one.

ADD REPLYlink modified 18 months ago • written 19 months ago by Kevin Blighe56k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1904 users visited in the last hour