What is RPKM/FPKM > 1 or 3 or 5?
2
2
Entering edit mode
2.8 years ago

Hello all,

I have a very basic question. In many papers and analysis we see analysis are been doing using genes having a threshold like RPKM/FPKM >1 or 3 or 5. What is this threshold? What does it mean and how do you calculate it? I'm having trouble understanding this and finding papers/articles to explain this. Any help is appreciated.

Thanks, Susmita

RNA-Seq rpkm fpkm normalization ngs • 4.7k views
0
Entering edit mode

For a nice explanation, also see StatQuest

2
Entering edit mode
2.8 years ago

The threshold itself is pretty arbitrary and should be based off of your own data. In general, what people are trying to do with this is to look at only "expressed" genes, for some hopefully reasonable meaning of expressed.

RPKM/FPKM is computed as follows:

"number of reads" / "length of gene or region in kb" / (total reads in millions)


For paired-end data, substitute "number of fragments" for reads. You can also get these values from a number of programs, such as stringTie and RSEM (I think RSEM produces them too, but don't quote me on that).

0
Entering edit mode

And how do you decide which ones are the "expressed" genes?

0
Entering edit mode

Those which have their RPKM/FPKM above a certain threshold are considered "expressed".

0
Entering edit mode

Using an arbitrary cutoff on these expression values - as you say typically 1, 3 or 5.

0
Entering edit mode

Does this cutoff means that all the genes in a particular sample are having at least this cut-off RPKM?

0
Entering edit mode

Yes. You filter the obtained RPKM counts to only keep genes with expression above that cut-off.

1
Entering edit mode

Important to remember, though, that, due to the way that these units are derived, the values are not cross comparable across samples.

To derive RPKM/FPKM expression units, samples are only normalised 'within themselves' - there is no cross-sample normalisation. Thus, due to external factors for which this normalisation method does not control, a value of 10 in one sample is not the same as 10 in another. For this reason, in addition, these units are not suitable for differential expression analysis and you should abandon their usage if your aim is to conduct differential expression.

0
Entering edit mode

What would you suggest instead?

0
Entering edit mode

Obtain the raw counts, if you can, and then use EdgeR or DEseq2 for performing normalisation and differential expression comparisons.

1
Entering edit mode
2.8 years ago

As mentioned, the purpose is to set a cutoff for what is considered 'expressed'. This is also where the concept of TPM (transcripts per million) started becoming popular rather then RPKM/FPKM since the attempt is to quantify the expression in a complete transcript. For what is considered a good cutoff is debatable by analysis groups. The Sequence Quality Consortium (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4810084/) and (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4321899/) is an FDA-led group that was put together since pharmaceutical companies were submitting RNA-Seq results rather then microarray data as proof of expression data. This group did a fairly good assessment on the consistencies and relative cutoffs for RNA-Seq data. They reported that as low as 1 FPKM was verifiable by RT-PCR. It is also well known that variability in RNA-Seq data greatly increases the lower expression.