Question

RNA-seq RPKM significance cut off

0

Entering edit mode

8.2 years ago

BioinfGuru ★ 2.1k

Hi all,

I want to identify tissue specific genes

I have 2 data sets:

Set 1) RPKM values for all gene in the tissue of interest

Set 2) RPKM values for all genes of many other tissue types

Should I choose a low RPKM cutoff for the genes in the tissue of interest? If so, how do I choose the cutoff?

I have 2 concerns:

A) If I dont choose a cuttoff then any expression (no matter how small) in the tissue that is not found in other tissues indicates tissue specificity For example: RPKM values for gene X in set 1 is 1.0 and in set 2 is 0.0 - Would it be a mistake to say gene X is is tissue specific?

B) If I do choose a cutoff am I incorrectly excluding rarely expressed genes? For example: A cutoff RPKM value of 2.0 would exclude gene X above.

Thanks in advance. Kenneth.

RNA-seq RPKM cutoff • 8.3k views

ADD COMMENT • link updated 8.2 years ago by Amitm ★ 2.3k • written 8.2 years ago by BioinfGuru ★ 2.1k

score 2 · Answer 1 · 2016-09-06

2

Entering edit mode

8.2 years ago

Amitm ★ 2.3k

hi,

I am assuming that your samples of interest are non-human (/non-murine), because such information exists (here & here) and might be useful. Regarding RPKM cut-off - I would do cut-off after log-trans. Most times RNA-seq data have large number of genes detected at very low levels. For any study looking at contrasts (diff. exp. or classification), filtering out such genes helps. A density plot of log-trans RPKM values would give you an idea of where to put a cut-off. Besides, I guess, a phenotype imparting tissue-specific gene would have expression level higher than basal/ background. I take a cut-off at log RPKM value 2 or 3 (depending on dataset).

Then if your tissue samples have replicates, you could proceed to do a PCA.

ADD COMMENT • link 8.2 years ago by Amitm ★ 2.3k

0

Entering edit mode

Excellent resources thank you very much

All data is from mice.

I have been given Dataset 1 which has a single RPKM value for each gene. The value is the mean of the 3 replicate RPKM values. I've got a bad feeling about combining the data like that. Is it worth raising this issue? Or is it a waste of time considering I only want to know if the gene is specific for that tissue, not how much of the gene is expressed by the tissue.

Regarding log-trans: Im assuming this is necessary to create a normal distribution only if the RPKM range is too large to plot. I raised doing this with my supervisor and was told to just use the raw RPKM values....is this a mistake?

Thanks for your help. Kenneth.

ADD REPLY • link 8.2 years ago by BioinfGuru ★ 2.1k

0

Entering edit mode

hi, RPKM values from RNA-seq don't assume normal distribution. See this - density_plot As I said about large number of genes being at 0 or minimal expression, so you get a massive peak and then a long tail extending miles away to very high values (but very less density there). Thats characteristic distribution of RNA-seq data. Doing log-trans (bottom panel of image) here only helps you "see" the peak which is otherwise not possible with raw RPKM (top panel).

Imp. - 1) Don't club replicates into a single value. Let, whatever statistics you choose, to see whats the mean and more importantly, whats the standard deviation, and make inferences. If you do a PCA where each tissue is represented by only one (averaged) sample, I don't know if anything meaningful would come out.

2) If you do PCA actually, working with log2 trans values would be helpful as the raw values are in exponential scale (i.e. from ~10 to >10k)

EDIT - The plots are made on TPM values (more comparable version of RPKM), but the scenario would be same with RPKM/ FPKM.

ADD REPLY • link 8.2 years ago by Amitm ★ 2.3k

0

Entering edit mode

Hi, I have selected RPKM cut off of 2 for my data set. Then how to exercise the cut off. Like, should I consider genes having average RPKM > 2 or genes with RPKM > 2 in all the taken replicates?

ADD REPLY • link 7.8 years ago by Neu ▴ 10

0

Entering edit mode

Hi I completed the pipeline successfuly - In my case I chose a cut-off of 1 RPKM -

Whether you choose to 1) average first and cut off after or 2) cut off first and average after is only a question of practicality (i.e. which ever is easiest in your script) because there is very little difference in the result.

In practice for me it was easier get the mean rpkm first....then get the log2 of the mean....then later in the pipeline I removed any genes where log2(meanrpkm)=0 ... because my cut off is 1 and log2(1)=0.

If you were to follow mine then you would remove genes where log2(meanrpkm)=1 .... because your cutoff is 2 and log2(2)=1

ADD REPLY • link 7.8 years ago by BioinfGuru ★ 2.1k