Dear Biostars! I think this is one of the common problems (which expression units to use, FPKM or RPKM) in RNA-Seq expression analysis. People who use cufflinks end up with FPKM and ERANGE with RPKM. Cufflinks has nice explanation why FPKM save us from the skewed expression values called by other softwares especially with paired-end read data....
They're almost the same thing. RPKM stands for Reads Per Kilobase of transcript per Million mapped reads. FPKM stands for Fragments Per Kilobase of transcript per Million mapped reads. In RNA-Seq, the relative expression of a transcript is proportional to the number of cDNA fragments that originate from it. Paired-end RNA-Seq experiments produce two reads per fragment, but that doesn't necessarily mean that both reads will be mappable. For example, the second read is of poor quality. If we were to count reads rather than fragments, we might double-count some fragments but not others, leading to a skewed expression value. Thus, FPKM is calculated by counting fragments, not reads.
However, after analyzing around 10 tissues paired end, long, polyA+, RNA-Seq datasets (after mapping them with TopHat and Bowtie), I noticed that same genes that have expression of FPKM between >0 and <1 have ~200 RPKM. I think this difference could cause serious problems in defining accurate expression units and defining the number of expressed or up-regulated or down-regulated..
I would appreciate if any answer or comment on using RPKM over FPKM or vice versa ? Gracias! :)
Just to make sure - if I have paired and reads, then one read can be mapped an other not and in this case I will count it as one fragment? And if both reads are mapped, I will also count it as one fragment? (Otherwise I do not understand how we could double-count some fragments when counting raw reads). Thank you very much for explanation.
An update (6th October 2018):
You should abandon RPKM / FPKM. They are not ideal where cross-sample differential expression analysis is your aim; indeed, they render samples incomparable via differential expression analysis:
Please read this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis
Also, by Harold Pimental: What the FPKM? A review of RNA-Seq expression units
So what should be used?
You could normalise your raw counts using edgeR or DESeq2. If you need to export data for downstream analyses, my preference is always the regularised log or variance-stabilised expression values from DESeq2.
Please, read this article,