Question: Can I use TCGA FPKM-UQ values directly to compare across samples without any preprocessing?
0
gravatar for robles.daniela
11 days ago by
United Kingdom
robles.daniela50 wrote:

Dear all,

This is a newbie question :)

I'm building a linear model to identify significant predictors of mutation count/types in tumours from TCGA. I want to include expression levels of a couple of genes, but I am quite new to RNA-Seq analyses and best practices. TCGA provides RNA-Seq data at the gene level in three formats: HTSeq-counts, FPKM and FPKM-UQ. I have been reading (tutorials and the questions here) and asking around and I have reached the conclusion that I can use FPKM-UQ values to compare across samples without any further pre-processing - Is this true? Or would you recommend doing pre-processing to these values before comparing?

Thanks so much, Daniela

rna-seq tcga fpkm-uq • 223 views
ADD COMMENTlink modified 10 days ago • written 11 days ago by robles.daniela50
2

Thanks so much both! I had seen that chart, Kevin, that is why I thought I could use FPKM-UQ directly. But given your advice and the paper Cindy sent over I will use HTSeq-counts and process through DESeq2 before doing any analyses. I will then compare with the results from using FPKM-UQ directly and post the results here when I have them.

Thanks both again! :)

Daniela

ADD REPLYlink written 10 days ago by robles.daniela50
1

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

If @Kevin's answer is acceptable you can mark it so (green check mark) to provide closure to this thread.

ADD REPLYlink written 10 days ago by genomax37k
3
gravatar for Kevin Blighe
11 days ago by
Kevin Blighe7.3k
Republic of Ireland (√Čire)
Kevin Blighe7.3k wrote:

Hi Daniela,

I would highly recommend the HTSeq counts, actually, because these will be raw counts. The FPKM method of normalisation has come under criticism in recent years and is now not even recommended by some sources. The main issue with FPKM normalisation is that cross-sample normalisation is non-existent, as such, it's akin to comparing multiple batches without even doing any correcting for batch.

Use HTSeq counts and load these into DESeq2 or EdgeR for downstream analyses.

I have recently analysed an entire TCGA RNAseq dataset (>500 samples) and I used HTSeq counts. They work very well.

Kevin

ADD COMMENTlink modified 11 days ago • written 11 days ago by Kevin Blighe7.3k
2

I do agree with Kevin. There was a paper in 2012 comparing the state of the art normalization techniques, and it stated that RPKM/FPKM should not be used and DESeq2 and TMM worked best. Have a look at the paper: https://academic.oup.com/bib/article/14/6/671/189645/A-comprehensive-evaluation-of-normalization

Best,

Cindy

ADD REPLYlink written 10 days ago by cindy.perscheid30

Dear Kevin, thanks so much for your answer! This is really helpful. I have one remaining question, I would be grateful if you could help me: I thought the FPKM-UQ was a modification of the FPKM normalisation to precisely allow cross-sample comparison, is this not the case then?

ADD REPLYlink written 11 days ago by robles.daniela50
1

Hi Daniela,

Yes, that is correct, and there are some other posts on Biostars about this topic, like: Differences between FPKM and FPKM-UQ files in gene expression analysis

My suggestion to use HTSeq raw counts is based on a few things:

  • by using raw counts, you have more control over the analysis (FPKM and FPKM-UQ should not be used with common differential expression analysis tools like DESeq2, EdgeR, and Limma, which expect raw counts). If you used FPKM, you would limit the amount of tools/programs that you could use for downstream analyses
  • by not using anything derived from FPKM, you save yourself criticism that would undoubtedly come whenever you tried to publish your work (which could go so far as you having to re-analyse all data depending on the reviewers' comments and the journal involved)
  • by using raw counts, you will have a better opportunity to pick up new skills.

One golden rule in data analysis and bioinformatics is to always aim to get the data in its rawest form possible such that you have most control over how to analyse it. :)

All of this being said, if, at your institute, there are already defined pipelines for analysing FPKM-UQ data, then this may prove the best 'political' option for you.

I also find this very simple flow-chart quite useful in relation to your question:

gene_expression_quantification_pipeline
image hosting

[source: https://gdc.cancer.gov/about-data/data-harmonization-and-generation/genomic-data-harmonization/high-level-data-generation/rna-seq-quantification]

ADD REPLYlink written 11 days ago by Kevin Blighe7.3k

by not using anything derived from FPKM, you save yourself criticism that would undoubtedly come whenever you tried to publish your work

That's a little extreme. There are a lot of FPKM-based papers in prestigious journals. It's not ideal, but criticism is unlikely.

Of course, it really depends on exactly what you are doing with these FPKMs.

ADD REPLYlink modified 8 days ago • written 8 days ago by igor4.7k

I agree with you, but only if the reviewers and journal editors are not up to speed with data analysis normalisation methods, which is probably going to be true for clinically-focused journals where the bioinformatics methods may not even be mentioned or may only appear in the supplementary.

It has been stated in published literature and from various sources that FPKM/RPKM is not ideal. It produces unreliable statistics from differential expression analysis.

ADD REPLYlink written 8 days ago by Kevin Blighe7.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 601 users visited in the last hour