Question: the problem with rpkm (and tpm)
16
user123456160 wrote:

Can you please explain the main core problem with RPKM normalization (as a measure of relative abundance), using a simple example, and why TPM solves this? Different explanations for why the RPKM unit is bad are: (a) it uses length normalization, (b) it normalizes to total library size, (c) because average RPKM value is not maintained across samples, or (d) few transcripts dominate the whole, or (e) assumption that total RNA is same across samples. Which is the main correct explanation?

1. This source (http://lectures.molgen.mpg.de/Functional_Genomics_WS1112/RNA-seq1.pdf) gives an example of two equal length genes. In condition 1, Gene A has 50,000 reads and Gene B has 0. In condition 2, Gene A has 50,000 reads and Gene B has 10,000 reads. The average RPKM in condition 1 and condition 2 is the same, so condition (c) isn't violated. Source says Gene A has different RPKMs in condition 1 and condition 2 - why is it problem? If reads sampled in proportion to length and expression there could not be 50,000 reads for gene A in both. Is this really a good example of RPKM giving the wrong answer?

2. The Wagner paper (http://lynchlab.uchicago.edu/publications/Wagner,%20Kin,%20and%20Lynch%20%282012%29.pdf) says the problem is normalization by total read length too. They say it leads to different average RPKMs per sample: "The reason for the inconsistency of RPKM across samples arises from the normalization by the total number of reads." In the example above this is not a problem. Average RPKM is the same in both conditions. Only when the gene lengths are different that you get different average:

If length A is 1000 and length of B is 10 then average RPKM for condition 1 is 0.5 and for condition 2 it is 8.75, so avg RPKM is different.

3. The same source (http://lectures.molgen.mpg.de/Functional_Genomics_WS1112/RNA-seq1.pdf) above says that RPKM is dominated by a few highly expressed genes since most of read counts come from those. How is that different from TPM?

4. This (https://www.ebi.ac.uk/training/sites/ebi.ac.uk.training/files/materials/2012/121029_HTS/ernest_turro_normalising_rna-seq_data.pdf) says the problem with RPKM is the assumption of same total RNA per sample. It gives the example: "Suppose you have two RNA populations A and B sequenced at same depth A and B are identical except half of genes in B are unexpressed in A. Only half of reads from B come from shared gene set. Estimates for shared genes di ffer by factor of ~2". Can someone clarify this with an example?

modified 5.8 years ago by Istvan Albert ♦♦ 84k • written 5.8 years ago by user123456160
1

Lior Pachter, the guy who introduced FPKM after Ali Mortazavi introduced RPKM, gave a very nice talk exactly about this at the Cold Spring harbour meeting some time ago (I reckon it was 2013). Here is the filmed talk starting at the appropriate time.

He there very carefully and very nicely explains why RPKM / FPKM (which is in fact the same just single / paired end reads) is not the best unit to use comparing RNA-seq experiments.

imho it's worth watching the whole video ;)

There's more than one core problem with RPKMs (as you seem to have noticed), but the deal-breaker will depend on your goals. It should also be noted that for standard gene-level differential expression, conversion to RPKM also loses precision information, so you can't as accurately weight samples when you're trying to estimate parameters (e.g., in a linear model).

The link for the EBI training presentation by Ernest Turro in #4 is broken. Here is the corrected link: https://www.ebi.ac.uk/sites/ebi.ac.uk/files/content.ebi.ac.uk/materials/2012/121029_HTS/ernest_turro_normalising_rna-seq_data.pdf

1
Istvan Albert ♦♦ 84k wrote:

There is a nice blogpost by Damian Kao on the subject,

It is my "go-to" source when I want to remind myself of the issue, breaks down the example above into more manageable and readable format:

http://blog.nextgenetics.net/?e=51

I read that but trying to reconcile with other posts. The example in Damian Kao post has 5 genes with all different lengths and example in my post has 2 genes (equal length) so it's much simpler. If main problem is norm by total library size then no need to look at different lengths

but that could just end up being an oversimplification that won't demonstrate the problem.

the issue at hand is the total transcript length if that changes the RPKM is an inappropriate measure - now if it does not change the measure still works

so is the simple example I linked two with two equal length genes a failure of rpkm in your view or not? is tpm performing differently in this simple case?

the point I am making is that if the example is too simple it won't show the problem - thus it is not a relevant example.

When we simplify a problem to demonstrate a concept we have to make the simplification so that it still captures the essence of the problem otherwise there is no point to the simplification.

as for your problem - if it does not show the problem with RPKM then it is too simple ...

but this is precisely my question - does this simple example capture the problem or not? the original source claims to demonstrate a problem with rpkm (without having genes with distinct lengths, but still normalizing by library size.) I don't see how tpm would fare differently in that example, so either I am missing something or the example is wrong.

Well compute it and that will tell you if it does or doesn't capture the problem. I for one don't feel motivated to work out toy problems for which I already understood the big picture - arithmetic and precision required to get the numbers right - i am saving the effort to problems that I don't know the answer for.

In this case the rationale is very simple. Does your transcript length change across conditions? If it does then RPKM will be inconsistent and the inconsistency will depend on just how big that variation is. This is the point that the blog post makes so well

If the total transcript length does not change then RPKM and TPM will simplify to the same quantity.