Can you please explain the main core problem with RPKM normalization (as a measure of relative abundance), using a simple example, and why TPM solves this? Different explanations for why the RPKM unit is bad are: (a) it uses length normalization, (b) it normalizes to total library size, (c) because average RPKM value is not maintained across samples, or (d) few transcripts dominate the whole, or (e) assumption that total RNA is same across samples. Which is the main correct explanation?
1. This source (http://lectures.molgen.mpg.de/Functional_Genomics_WS1112/RNA-seq1.pdf) gives an example of two equal length genes. In condition 1, Gene A has 50,000 reads and Gene B has 0. In condition 2, Gene A has 50,000 reads and Gene B has 10,000 reads. The average RPKM in condition 1 and condition 2 is the same, so condition (c) isn't violated. Source says Gene A has different RPKMs in condition 1 and condition 2 - why is it problem? If reads sampled in proportion to length and expression there could not be 50,000 reads for gene A in both. Is this really a good example of RPKM giving the wrong answer?
2. The Wagner paper (http://lynchlab.uchicago.edu/publications/Wagner,%20Kin,%20and%20Lynch%20%282012%29.pdf) says the problem is normalization by total read length too. They say it leads to different average RPKMs per sample: "The reason for the inconsistency of RPKM across samples arises from the normalization by the total number of reads." In the example above this is not a problem. Average RPKM is the same in both conditions. Only when the gene lengths are different that you get different average:
Condition 1: Gene A: 50,000 reads, Gene B: 0 reads
Condition 2: Gene A: 50,000 reads, Gene B: 10,000 reads
If length A is 1000 and length of B is 10 then average RPKM for condition 1 is 0.5 and for condition 2 it is 8.75, so avg RPKM is different.
3. The same source (http://lectures.molgen.mpg.de/Functional_Genomics_WS1112/RNA-seq1.pdf) above says that RPKM is dominated by a few highly expressed genes since most of read counts come from those. How is that different from TPM?
4. This (https://www.ebi.ac.uk/training/sites/ebi.ac.uk.training/files/materials/2012/121029_HTS/ernest_turro_normalising_rna-seq_data.pdf) says the problem with RPKM is the assumption of same total RNA per sample. It gives the example: "Suppose you have two RNA populations A and B sequenced at same depth A and B are identical except half of genes in B are unexpressed in A. Only half of reads from B come from shared gene set. Estimates for shared genes differ by factor of ~2". Can someone clarify this with an example?