Question: FPKM vs raw counts vs RPKM
gravatar for NHEJ
5.4 years ago by
United States
NHEJ320 wrote:

Could someone please explain to me (in as many layman's terms as possible for someone new to the RNA-seq field) the fundamental differences between FPKM, counts, and RPKM?  I have heard  from some bioinformatics colleagues that raw counts (DEseq) are becoming more popular than FPKMs (Cufflinks) to analyze transcriptonomic data, but I am not sure why (or whether this is 100% always true) other than I heard that FPKMs may "over-normalize" too much depending on the experiment.  Much of the available published literature on these topics is a bit specialized, so I was wondering if someone could "bring it down to Earth" so to speak on how to understand the differences, pros/cons, and (if possible) special use-cases of when one approach is better to use than another?



raw counts rna-seq fpkm rpkm • 35k views
ADD COMMENTlink modified 4.1 years ago by Daniel3.8k • written 5.4 years ago by NHEJ320

This workflow helped me a lot getting myself familiar to RNA-seq data analysis. It imports your raw counts and then you can analyze them using different packages.

ADD REPLYlink written 5.4 years ago by Parham1.4k
gravatar for iraun
5.4 years ago by
iraun3.7k wrote:

Tophat aligns the reads to the reference genome, and classifies the reads attending to if they have aligned with and without splice junctions:

  • With splice junctions -->  anything that jumps regions must span an intron.
  • Without --> anything that maps unspliced must be an exon.

Then Cufflinks assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples. Each gene contains one or more transcripts and each transcript has multiple exons, but the transcripts within a given gene share exons and that's why the reads maps probabilistically (does not report read counts). FPKM are the "fancy" units that cufflinks uses specifically to report its probabilistic estimates of isoform abundances.

FPKM vs RPKM: using "F" in place of"R" is only in order to unify the terminology, they switched from "Reads" to "Fragments" to clean up confusion regarding paired end reads.

ADD COMMENTlink written 5.4 years ago by iraun3.7k

+1 for down to Earth answer.  But I don't understand your logic when you say "and that's why the reads map probabilistically (does not report read counts)."  Could you please expand on this?  Are you saying that FPKM is determined probabilisticly according to some sort of algorithm prediction which estimates what exons will be in what transcripts of, say, gene X?  I mean what if you don't know what transcripts a gene makes ahead of time, not to mention what combinatorial assembly of exons is in each of these transcripts, how can a program predict any such vast complexity?  It seems like a shot in the dark, especially at a large scale like the genome.  Thanks in advance for your insights!

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by NHEJ320

I'm also a bit confused in your explanation of splice junctions... You're saying that without splice junctions means anything that maps unspliced must be an exon.  But what if the read is from a noncoding part of the genome (this would not map to an exon but could potentially span a splice junction)?  Perhaps I am misusing the term noncoding here...

ADD REPLYlink written 5.4 years ago by NHEJ320

I'll try to explain clearer (sorry, it is quite difficult).

Assuming that we have one exon which is shared between 3 transcripts of the same gene. Since you can not know if that exon is expressed  because which of three transcripts, cufflinks can not report counts for a transcript. Instead of that it reports an estimation of the transcript abundances. If you are looking at transcript FPKMs and the gene in question has alternative transcripts, one of the isoforms could get a zero estimate while another (or several others) would get the reads assigned to it/them.

"In RNA-Seq experiments, cDNA fragments are sequenced and mapped back to genes and ideally, individual transcripts. Properly normalized, the RNA-Seq fragment counts can be used as a measure of relative abundance of transcripts, and Cufflinks measures transcript abundances in Fragments Per Kilobase of exon per Million fragments mapped (FPKM), which is analagous to single-read "RPKM".

ADD REPLYlink written 5.4 years ago by iraun3.7k

Regarding the second question, what do you mean by "noncoding part of genome"?

ADD REPLYlink written 5.4 years ago by iraun3.7k

By noncoding, I mean in the intronic portions of the genome, such as those that may produce lncRNAs.

ADD REPLYlink written 5.4 years ago by NHEJ320

Could you please explain how certain transcripts can have FPKM of 0.0000?  How does this assignment happen and how does Cufflinks calculate this estimate?

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by NHEJ320
gravatar for geek_y
5.4 years ago by
geek_y11k wrote:

RPKM/FPKM are normalised counts. DESeq/edgeR requires raw counts as input as they have their own normalisation methods. 

DESeq/edgeR are better for exon/gene expression analysis. Cufflinks is for differential isoform analysis. If you just care about differential genes, go for htseq-count --> EdgeR/DESeq . If you are interested in isoform level analysis, for for programs such as Cufflinks/Cuffdiff packages. 

ADD COMMENTlink modified 4.1 years ago • written 5.4 years ago by geek_y11k
gravatar for Charles Warden
5.4 years ago by
Charles Warden7.7k
Duarte, CA
Charles Warden7.7k wrote:

I think the question has pretty much been answered, but I thought it might also be nice to throw in a link to this blog post that I think provides a nice summary:

ADD COMMENTlink written 5.4 years ago by Charles Warden7.7k

Note the line in the piece to which you link:

The first thing one should remember is that without between sample normalization (a topic for a later post), NONE of these units are comparable across experiments. This is a result of RNA-Seq being a relative measurement, not an absolute one.

That is, it is not recommended to use RPKM / FPKM for cross-sample differential expression. This is also highlighted in the paper linked by Daniel in his answer.

ADD REPLYlink written 16 months ago by Kevin Blighe60k
gravatar for Daniel
4.1 years ago by
Cardiff University
Daniel3.8k wrote:

I know this is an old question, but I was recently reading the paper "A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis" which I highly recommend. It includes these two graphs, which I think summarise the issues around different normalisations.

Note: TMM is the method used in edgeR

RNAseq normalisation methods

False Positive Rate

ADD COMMENTlink written 4.1 years ago by Daniel3.8k

I thought the paper brought up some interesting points, but I would say these plots are a little confusing for a few reasons:

1) The first figure you show above (Figure 1A in the paper) is for the mouse data, which Table 1 says miRNA-Seq data. If a small-RNA protocol is being used, then I wouldn't expect to use RPKM values over count-per-million (equivalent to total-count, TC, above).

I'm guessing they did this because the scale of values will otherwise be different for RPKM, as shown in Figure S1:

This is partially a good thing, if you want to know the difference between a short gene with a lot of reads and long gene with a medium level of coverage. Also, the distributions for the human (and non-human, but still RNA-Seq) RPKM values (where you would expect to use see RPKM expression values) are more consistent than the miRNA-Seq RPKM values (where the target gene sizes are roughly similar).

2) The second figure shown above (Figure 2A in the paper) is for simulated data, not the real datasets used previously. While I understand that it would be hard to estimate the false positive rate from those datasets, Table 2 indicates decreased power for the RPKM, RawCount, and TC normalizations, but the gene overlap was always pretty good. This makes sense to me, and it would indicate a decrease in power but not a decrease in false positive rate (although that is the opposite of what the simulated data shows in Figure 2). At least in the DESeq portion of the table, this was also true for the human data in Table S6 (although it brings into question what is due to the normalization versus how the p-value is calculated).

3) The simulated datasets (second figure above) range the differentially expressed genes from 0-30%. I would typically want to identify a few hundred differential expressed genes (so, <5%, in a human or mouse RNA-Seq dataset), so I would probably only pay attention to the first few bars.

4) In absolute terms, all estimated false positive rates for the simulated datasets was less than 0.25, which is not that bad. However, if the first value is the 0% differentially expressed gene group (assuming the bars represent 5% increases in differentially expressed genes), then I don't see how it can have a 5% false positive rate.

ADD REPLYlink written 4.1 years ago by Charles Warden7.7k
gravatar for Ming Tang
4.1 years ago by
Ming Tang2.6k
Houston/MD Anderson Cancer Center
Ming Tang2.6k wrote:

check here

ADD COMMENTlink written 4.1 years ago by Ming Tang2.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1594 users visited in the last hour