Question: Does FPKM scale incorrectly in case of unequal mapping rates?
gravatar for Michael Dondrup
5.0 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

I have the impression that FPKM as calculated by cufflinks doesn't correctly normalize for library size in case of unequal mapping rates between samples (e.g. in case of contamination). I get this impression from real data.

As a hypothetical example let's say that we have two libraries A,B, B 10 times the number of reads as A. One might expect, (I know it's a rough estimate)  for a large number of genes in B to have 10 times the number of reads as in A, but mostly equal FPKM except for lowly expressed transcripts. Now assume A has ~100% mapping rate, and B only 50% (due to contamination,  40% will align to other references if known). Then, many transcripts in B still seem to have 10X as many reads mapped as in A (maybe because number of mapped reads is not influenced linearly by contamination?), but the effective aligned library size that cufflinks sees is only 50% of the true library size. That would lead to an average 2-fold difference in median of all FPKM values. 

Reference size is not taken into account for FPKM, is it? If I knew all contaminant genomes beforehand, I could just add them to the reference, and achieve a much more homogeneous mapping rate and better normalization between samples. On the other hand this might not be a fair comparison, as the contaminants would never have a chance of getting a FPKM assigned and will not influence the calculated average FPKM.

Is this correct from your experience, how to best compensate for it:

  1. Add all possible contaminants to reference before alignment?
  2. Use the sequenced library size instead of aligned library size?
  3. Assume equal mapping rate for all samples (e.g. median, would be similar to median scaling of FPKM)?
  4. Apply quantile normalization to FPKM data?
  5. FPKM is already normalized, don't apply double normalization.... 
ADD COMMENTlink modified 5.0 years ago by Istvan Albert ♦♦ 80k • written 5.0 years ago by Michael Dondrup46k

What's the goal? Are you trying to get some sort of contaminant-normalized FPKM such that equal expression between samples would produce different number if they have differing levels of contamination?

BTW, no, reference size isn't involved in FPKM calculation. Depending on your goals, you might find something like the way edgeR calculates fpkm (i.e., it does the normal library normalization and uses the resulting normalization factor in the fpkm calculation rather than the total mapped reads). Some variant of that might serve your needs better.

ADD REPLYlink written 5.0 years ago by Devon Ryan89k

Hi, the goal is to get comparable expression values between libraries such that the values become independent of contamination/mapping rate, for clustering.

ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by Michael Dondrup46k

I am switching to log2 transformed CPM/TTM (edgeR) normalized counts. The distributions and quantiles of each library look much more homogeneous and "normal-like" than RPKM values.  

ADD REPLYlink written 4.9 years ago by Michael Dondrup46k
gravatar for Istvan Albert
5.0 years ago by
Istvan Albert ♦♦ 80k
University Park, USA
Istvan Albert ♦♦ 80k wrote:

Well, some (including  one of the original authors of the paper that introduced it) say that the RPKM/FPKM concept is flawed  altogether.

Honestly I am not quite sure what to make of that statement. If indeed true then it is not just a casual observation. It would mean the ALL papers that use  RPKM to compare gene expression are wrong. But then the observation was made in 2012, RPKM/FPKM are still heavily used. So there.

and here:


ADD COMMENTlink modified 5.0 years ago • written 5.0 years ago by Istvan Albert ♦♦ 80k

Hi, I am accepting this answer, because it came clear to me that FPKM is not even a "normalization" method, because as such it obviously doesn't do what it is supposed to, make sample distributions comparable by removing systematic bias (If looking at different samples the meand/median after 'normalization' are deviating by a large factor). Therefore, we cannot use FPKM because it is inherently broken, and there is no known fix to it. This became clear to me after reading other critical posts on topic (have argued against it myself) as well as recording the article by Dillies et al (Brief. in Bioinformatics, 2013).

Why is still used so often, see also 80 posts on BioStar? Maybe because bad habits die hard? Also it might be convenient because some tools provide it easily. I think it is easy to forget all concerns for a moment and then just use what is there until checking the data a bit more carefully. 



ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by Michael Dondrup46k

An update (6th October 2018):

You should abandon RPKM / FPKM. They are not ideal where cross-sample differential expression analysis is your aim; indeed, they render samples incomparable via differential expression analysis:

Please read this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis

The Total Count and RPKM [FPKM] normalization methods, both of which are still widely in use, are ineffective and should be definitively abandoned in the context of differential analysis.

Also, by Harold Pimental: What the FPKM? A review of RNA-Seq expression units

The first thing one should remember is that without between sample normalization (a topic for a later post), NONE of these units are comparable across experiments. This is a result of RNA-Seq being a relative measurement, not an absolute one.

ADD REPLYlink modified 6 months ago • written 8 months ago by Kevin Blighe41k
gravatar for Devon Ryan
5.0 years ago by
Devon Ryan89k
Freiburg, Germany
Devon Ryan89k wrote:

If you want the values to be independent of contamination then options (1) and (2) would be out of the question. Your best bet would be either quantile normalization (or perhaps loess normalization), since it's quite unlikely that contamination affects transcripts linearly (in fact, it would be extremely surprising if this turned out to be the case, since RNAseq reads are competitive, which is the same reason RPKM values are unstable with differing levels of rRNA), or just using the more highly expressed genes in your clustering, since low expressers will be more affected by contamination.

ADD COMMENTlink written 5.0 years ago by Devon Ryan89k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1137 users visited in the last hour