Question

the problem with rpkm (and tpm)

18

Entering edit mode

9.5 years ago

user123456 ▴ 180

Can you please explain the main core problem with RPKM normalization (as a measure of relative abundance), using a simple example, and why TPM solves this? Different explanations for why the RPKM unit is bad are: (a) it uses length normalization, (b) it normalizes to total library size, (c) because average RPKM value is not maintained across samples, or (d) few transcripts dominate the whole, or (e) assumption that total RNA is same across samples. Which is the main correct explanation?

1. This source gives an example of two equal length genes. In condition 1, Gene A has 50,000 reads and Gene B has 0. In condition 2, Gene A has 50,000 reads and Gene B has 10,000 reads. The average RPKM in condition 1 and condition 2 is the same, so condition (c) isn't violated. Source says Gene A has different RPKMs in condition 1 and condition 2 - why is it problem? If reads sampled in proportion to length and expression there could not be 50,000 reads for gene A in both. Is this really a good example of RPKM giving the wrong answer?

2. The Wagner paper says the problem is normalization by total read length too. They say it leads to different average RPKMs per sample: "The reason for the inconsistency of RPKM across samples arises from the normalization by the total number of reads." In the example above this is not a problem. Average RPKM is the same in both conditions. Only when the gene lengths are different that you get different average:

Condition 1: Gene A: 50,000 reads, Gene B: 0 reads

Condition 2: Gene A: 50,000 reads, Gene B: 10,000 reads

If length A is 1000 and length of B is 10 then average RPKM for condition 1 is 0.5 and for condition 2 it is 8.75, so avg RPKM is different.

3. The same source as #1 above says that RPKM is dominated by a few highly expressed genes since most of read counts come from those. How is that different from TPM?

4. This says the problem with RPKM is the assumption of same total RNA per sample. It gives the example: "Suppose you have two RNA populations A and B sequenced at same depth A and B are identical except half of genes in B are unexpressed in A. Only half of reads from B come from shared gene set. Estimates for shared genes differ by factor of ~2". Can someone clarify this with an example?

RNA-Seq next-gen gene-expression normalization • 17k views

ADD COMMENT • link updated 10 months ago by Chris ▴ 260 • written 9.5 years ago by user123456 ▴ 180

2

Entering edit mode

Lior Pachter, the guy who introduced FPKM after Ali Mortazavi introduced RPKM, gave a very nice talk exactly about this at the Cold Spring harbour meeting some time ago (I reckon it was 2013). Here is the filmed talk starting at the appropriate time.

He there very carefully and very nicely explains why RPKM / FPKM (which is in fact the same just single / paired end reads) is not the best unit to use comparing RNA-seq experiments.

IMHO it's worth watching the whole video ;)

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.5 years ago by Phil S. ▴ 700

0

Entering edit mode

There's more than one core problem with RPKMs (as you seem to have noticed), but the deal-breaker will depend on your goals. It should also be noted that for standard gene-level differential expression, conversion to RPKM also loses precision information, so you can't as accurately weight samples when you're trying to estimate parameters (e.g., in a linear model).

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.5 years ago by Devon Ryan 104k

0

Entering edit mode

The link for the EBI training presentation by Ernest Turro in #4 is broken. Here is the corrected link: https://www.ebi.ac.uk/sites/ebi.ac.uk/files/content.ebi.ac.uk/materials/2012/121029_HTS/ernest_turro_normalising_rna-seq_data.pdf

ADD REPLY • link 8.0 years ago by warren-mcgee ▴ 40

Ram · Answer 1 · 2014-10-16

1

Entering edit mode

9.5 years ago

Istvan Albert 100k

There is a nice blogpost by Damian Kao on the subject,

It is my "go-to" source when I want to remind myself of the issue, breaks down the example above into more manageable and readable format: http://blog.nextgenetics.net/?e=51

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.5 years ago by Istvan Albert 100k

0

Entering edit mode

I read that but trying to reconcile with other posts. The example in Damian Kao post has 5 genes with all different lengths and example in my post has 2 genes (equal length) so it's much simpler. If main problem is norm by total library size then no need to look at different lengths

ADD REPLY • link 9.5 years ago by user123456 ▴ 180

0

Entering edit mode

but that could just end up being an oversimplification that won't demonstrate the problem.

the issue at hand is the total transcript length if that changes the RPKM is an inappropriate measure - now if it does not change the measure still works

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.5 years ago by Istvan Albert 100k

0

Entering edit mode

so is the simple example I linked two with two equal length genes a failure of rpkm in your view or not? is tpm performing differently in this simple case?

ADD REPLY • link 9.5 years ago by user123456 ▴ 180

0

Entering edit mode

the point I am making is that if the example is too simple it won't show the problem - thus it is not a relevant example.

When we simplify a problem to demonstrate a concept we have to make the simplification so that it still captures the essence of the problem otherwise there is no point to the simplification.

as for your problem - if it does not show the problem with RPKM then it is too simple ...

ADD REPLY • link 9.5 years ago by Istvan Albert 100k

0

Entering edit mode

but this is precisely my question - does this simple example capture the problem or not? the original source claims to demonstrate a problem with rpkm (without having genes with distinct lengths, but still normalizing by library size.) I don't see how tpm would fare differently in that example, so either I am missing something or the example is wrong.

ADD REPLY • link 9.5 years ago by user123456 ▴ 180

0

Entering edit mode

Well compute it and that will tell you if it does or doesn't capture the problem. I for one don't feel motivated to work out toy problems for which I already understood the big picture - arithmetic and precision required to get the numbers right - I am saving the effort to problems that I don't know the answer for.

In this case the rationale is very simple. Does your transcript length change across conditions? If it does then RPKM will be inconsistent and the inconsistency will depend on just how big that variation is. This is the point that the blog post makes so well

If the total transcript length does not change then RPKM and TPM will simplify to the same quantity.

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.5 years ago by Istvan Albert 100k

0

Entering edit mode

Hi @Istavan. The link seems no longer available. Would you share another one?

ADD REPLY • link 10 months ago by Chris ▴ 260