Question

RPKM and FPKM directly comparable?

1

Entering edit mode

7.6 years ago

james.mcauliffe ▴ 10

Hi there,

Fairly new to this area so will try to explain as clearly as possible. I have two RNA-Seq data sets - one corresponds to a series of cancer cell lines, the other to cell lines we are using as a model of 'normal' epithelium. The expression units of one data set are in FPKM the other RPKM. I think I superficially understand the difference between these units, in that one is used in mapping transcripts in single end sequencing the other paired end sequencing.

My question is are these units directly comparable? The analysis I wish to carry out is fairly straight forward - I have a predefined list of around 500 genes, and simply want to compare differences in expression between the non-cancer/cancer background, in terms of which of these genes they are expressing at all as well as the relative expression levels. For which transcripts are expressed I had intended to use any value over 0 (FPKM or RPKM) as denoting expression of a transcript, but am unsure if I can compare relative abundance.

I should add that I only have access to the raw data of one dataset, the other is as a results table sent by collaborators.

RNA-Seq • 5.3k views

ADD COMMENT • link 7.6 years ago by james.mcauliffe ▴ 10

0

Entering edit mode

For reference, see Wagner et al 2012, as well as this blog post from Harold Pimentel. Both are very useful clarifications, and have specific interest for comparing between datasets.

ADD REPLY • link 7.6 years ago by bruce.moran ▴ 970

score 7 · Accepted Answer · 2016-12-29

7

Entering edit mode

7.6 years ago

Brian Bushnell 20k

The difference between RPKM and FPKM is generally going to be much less than the difference between RPKM and RPKM for two different experiments with different methodologies. That is to say, since one is in RPKM and the other FPKM, the processing was probably different and thus they are not safe to compare regardless of the units. I recommend you try to get a hold of the original data and do the processing identically if you wish to compare results. Even then there may be other factors out of your control (wet-lab protocols) that make them incomparable.

You cannot convert between RPKM and FPKM, but in most cases they are extremely similar so if that were the only difference I doubt it would be significant.

ADD COMMENT • link 7.6 years ago by Brian Bushnell 20k

0

Entering edit mode

You can convert them, in principle. The only difference between F and R is that F=R/N where N is the number of reads per fragment. So for single-end data, RPKM and FPKM are identical. For paired-end, FPKM = RPKM/2. In principle.

If you used an FPKM estimator like Cufflinks or RSEM, then it will overall correlate well with manual calculations, but can vary dramatically for a few genes, because the R value is not the actual observed R, but a statistical estimation based on a bunch of other parameters.

But as the others are saying, you generally cannot compare R/FPKM values between datasets, mainly because the definition of M -- the largest driver of R/FPKM magnitude -- is not standardized. Some use total reads, some use mappable reads, some use gene-aligned reads, etc.

ADD REPLY • link 7.5 years ago by apa@stowers ▴ 600

score 6 · Accepted Answer · 2016-12-30

In theory RPKM should just be twice FPKM, but that is unlikely to actually be the case, as is noted by the other commenters, differences in processing protocol, are likely to mean they are not comparable, and would not be even if they were both FPKM or both RPKM.

You might be able to get some meaning from the dataset if you are just doing presence/absence calls, but I wouldn't use a 0 threshold. Instead convert your values to TPM (transcripts per million) but normalising each value to the sum of expression for that sample and multiplying by 1 million. E.g. if I had RPKMs from 4 genes of 0, 1, 5 and 4 the TPMs would be (0x10^6)/10, (110^6)/10, (510^6)/10 and (410^6)/10. Transcripts Per Million is much easier to reason with and more comparable between samples. As a very approximate rule of thumb, a human cell has around 200,000 mRNAs in it at any one time, so a TPM of 5 corresponds to one molecule per cell and can serve as a handy rule of thumb for calling present/absent.

Be careful though as differences in processing might lead to false positive differences - e.g. if one set was processed allowing multimapping and the other not, then you will be a present call for paralogs in one sample and an absent call in the other, irrespective of whether there is actually an expression difference.

score 2 · Accepted Answer · 2016-12-30

As already mentioned, 2*RPKM should be equal to FPKM , if you are dealing with paired end reads in both cases.

FPKM/RPKM was intended as a approximate concentration value. Within an experiment, you can make comparisons between transcripts. Between experiments you should rather use TPM for transcripts or raw counts for genes (these vales are usually normalised with the tools like DESeq2 or edgeR).

Moreover, if you use different types of software to compute your bad numbers, comparisons will become even harder. Cufflinks uses an internal length normalisation and coverage based assumptions which you have to take into consideration when comparing.

Try to get the raw data from your collaborator, or at least the alignments.

Cheers

Michael