Question: RNA-seq RPKM significance cut off
gravatar for YaGalbi
3.7 years ago by
Biocomputing, MRC Harwell Institute, Oxford, UK
YaGalbi1.5k wrote:

Hi all,

I want to identify tissue specific genes

I have 2 data sets:

Set 1) RPKM values for all gene in the tissue of interest

Set 2) RPKM values for all genes of many other tissue types

Should I choose a low RPKM cutoff for the genes in the tissue of interest? If so, how do I choose the cutoff?

I have 2 concerns:

A) If I dont choose a cuttoff then any expression (no matter how small) in the tissue that is not found in other tissues indicates tissue specificity For example: RPKM values for gene X in set 1 is 1.0 and in set 2 is 0.0 - Would it be a mistake to say gene X is is tissue specific?

B) If I do choose a cutoff am I incorrectly excluding rarely expressed genes? For example: A cutoff RPKM value of 2.0 would exclude gene X above.

Thanks in advance. Kenneth.

rna-seq cutoff rpkm • 5.0k views
ADD COMMENTlink modified 3.7 years ago by Amitm1.9k • written 3.7 years ago by YaGalbi1.5k
gravatar for Amitm
3.7 years ago by
Amitm1.9k wrote:


I am assuming that your samples of interest are non-human (/non-murine), because such information exists (here & here) and might be useful. Regarding RPKM cut-off - I would do cut-off after log-trans. Most times RNA-seq data have large number of genes detected at very low levels. For any study looking at contrasts (diff. exp. or classification), filtering out such genes helps. A density plot of log-trans RPKM values would give you an idea of where to put a cut-off. Besides, I guess, a phenotype imparting tissue-specific gene would have expression level higher than basal/ background. I take a cut-off at log RPKM value 2 or 3 (depending on dataset).

Then if your tissue samples have replicates, you could proceed to do a PCA.

ADD COMMENTlink written 3.7 years ago by Amitm1.9k

Excellent resources thank you very much

All data is from mice.

I have been given Dataset 1 which has a single RPKM value for each gene. The value is the mean of the 3 replicate RPKM values. I've got a bad feeling about combining the data like that. Is it worth raising this issue? Or is it a waste of time considering I only want to know if the gene is specific for that tissue, not how much of the gene is expressed by the tissue.

Regarding log-trans: Im assuming this is necessary to create a normal distribution only if the RPKM range is too large to plot. I raised doing this with my supervisor and was told to just use the raw RPKM this a mistake?

Thanks for your help. Kenneth.

ADD REPLYlink written 3.7 years ago by YaGalbi1.5k

hi, RPKM values from RNA-seq don't assume normal distribution. See this - density_plot As I said about large number of genes being at 0 or minimal expression, so you get a massive peak and then a long tail extending miles away to very high values (but very less density there). Thats characteristic distribution of RNA-seq data. Doing log-trans (bottom panel of image) here only helps you "see" the peak which is otherwise not possible with raw RPKM (top panel).

Imp. - 1) Don't club replicates into a single value. Let, whatever statistics you choose, to see whats the mean and more importantly, whats the standard deviation, and make inferences. If you do a PCA where each tissue is represented by only one (averaged) sample, I don't know if anything meaningful would come out.

2) If you do PCA actually, working with log2 trans values would be helpful as the raw values are in exponential scale (i.e. from ~10 to >10k)

EDIT - The plots are made on TPM values (more comparable version of RPKM), but the scenario would be same with RPKM/ FPKM.

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Amitm1.9k

Hi, I have selected RPKM cut off of 2 for my data set. Then how to exercise the cut off. Like, should I consider genes having average RPKM > 2 or genes with RPKM > 2 in all the taken replicates?

ADD REPLYlink written 3.4 years ago by Neu10

Hi I completed the pipeline successfuly - In my case I chose a cut-off of 1 RPKM -

Whether you choose to 1) average first and cut off after or 2) cut off first and average after is only a question of practicality (i.e. which ever is easiest in your script) because there is very little difference in the result.

In practice for me it was easier get the mean rpkm first....then get the log2 of the mean....then later in the pipeline I removed any genes where log2(meanrpkm)=0 ... because my cut off is 1 and log2(1)=0.

If you were to follow mine then you would remove genes where log2(meanrpkm)=1 .... because your cutoff is 2 and log2(2)=1

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by YaGalbi1.5k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 820 users visited in the last hour