Hi guys, I am planning to perform a pan-cancer gene expression analysis across several cancer types. However, I found that the TCGA data portal has been replaced by GDC. After carefully checking the harmonized data in GDC, I am now wondering which file I should use for gene expression analysis, FPKM or FPKM-UQ? What are the differences between the two file types? Previously, I used the files with suffix "rsem.genes.normalized_results" to perform the gene expression analysis. Is FPKM the same as the "*.rsem.genes.normalized_results" file? If so, when shall we use FPKM-UQ? Any help would be really appreciated. Thanks
FPKM and UQ-FPKM are calculated by GDC just for legacy reason, because ppl used to use FPKM data, and UQ provides a method for normalization. However, for any serious analysis, using count data with DESeq/EdgeR are encouraged.
Just to add, FPKM can still be really useful. As it normalizes reads to correct for transcript size, it can be useful to correct across potential differences in input RNA quality. Just make sure you know what you are working with. If you are doing some heavy CPM cutoff, obviously FPKM will drastically alter your results in a negative sense. Zhenyu is not wrong, but I felt it worth adding the caveat.
An update (6th October 2018):
You should abandon RPKM / FPKM. They are not ideal where cross-sample differential expression analysis is your aim; indeed, they render samples incomparable via differential expression analysis:
Please read this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis
Also, by Harold Pimental: What the FPKM? A review of RNA-Seq expression units
shouldn't the quantile normalization make FPKM-UQ be suitable for cross-sample comparison?
here they tried different normalization methods including FPKM-UQ without finding big differences
...'they' == a single author? The bias in FPKM expression units exists from the very beginning when these units are created. No further transformation can then mitigate this bias without first reverse-engineering these to raw counts.
They compare methods they apply to UQ-normalized files, they do not compare methods operating on raw counts to produce the normalized counts in the first place.