Question: To What Degree Could We Determine Whether A Gene Is Expressed?
gravatar for liupfskygre
4.0 years ago by
United States
liupfskygre180 wrote:

Hi all, A question came to my mind when analysis gene expression using RNA-seq. After using DEseq2, a large number of genes were listed as differential expressed (e.g. FDR<0.01 or 0.05, absolute of log2 fold change >1). When took look at the FPKM of those genes, lots of them were very small. I think took all of them for further analysis is time consuming and also included a bunch of noises.

I saw a paper use the FPKM of one housekeeping gene they used to do QPCR as the cutoff. (

In my opinion, use the mean FPKM (total FPKM of genes/gene number) of each condition as a cutoff/threshold might be a choice.

Any other thoughts on this issue are greatly appreciated!

fpkm rna-seq • 4.3k views
ADD COMMENTlink modified 3.1 years ago by cankutcubuk150 • written 4.0 years ago by liupfskygre180

Are you removing genes with very low overall read counts from your analysis?

ADD REPLYlink written 3.9 years ago by Biomonika (Noolean)3.0k
gravatar for Ryan Dale
4.0 years ago by
Ryan Dale4.6k
Bethesda, MD
Ryan Dale4.6k wrote:

This 2013 BMC Genomics paper describes zFPKM, which they use to set a single threshold for yes/no expression across a bunch of experiments. Their methods are clear and straightforward to implement -- here's how you might calculate zFPKM in Python:

from scipy.stats import gaussian_kde
import numpy as np

# assume "fpkm" is a NumPy array of log2(fpkm) values.

kernel = gaussian_kde(fpkm)
xi = np.linspace(fpkm.min(), fpkm.max(), 100)
yi = kernel.evaluate(xi)
mu = xi[np.argmax(yi)]
U = fpkm[fpkm > mu].mean()
sigma = (U - mu) * np.sqrt(np.pi / 2)
zFPKM = (fpkm - mu) / sigma

Then you can choose a zFPKM value to use as your threshold. They chose zFPKM = -3 for human RNA-seq after comparing with active/repressive histone mod data.

ADD COMMENTlink written 4.0 years ago by Ryan Dale4.6k

thank you! this is what I am looking for!

ADD REPLYlink written 4.0 years ago by liupfskygre180

Hi Daler! I am new to programming and am having some trouble creating a NumPy array. When I use kernel=gaussian_kde(fpkm), I get this error:

>>> kernel=gaussian_kde(fpkm1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scipy/stats/", line 188, in __init__
    raise ValueError("`dataset` input should have multiple elements.")
ValueError: `dataset` input should have multiple elements.

I was wondering if you could please explain how to create a NumPy array from an Excel, .txt or .csv file? I appreciate your help so much!

ADD REPLYlink written 3.5 years ago by kdixo1000

It's impossible to tell what happened here without knowing how you created fpkm1. But you might want to take a look at the NumPy genfromtxt() docs or the Pandas IO tools documentation.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Ryan Dale4.6k

Okay thank you! I'll take a look at those.

ADD REPLYlink written 3.5 years ago by kdixo1000
gravatar for cankutcubuk
3.1 years ago by
cankutcubuk150 wrote:


Can I apply the zFPKM approach by RPKM data(zRPKM)?

Or which kind of normalization methods are compatible with this method?

By the way, here you can find the R script for zFPKM normalization.

I inspired from the python code which has given above.


/* fpkm is an example data */

fpkm <- c(1,2,3,4,5,6,7,8,4,5,6,5,6,5,6,5,5,5,5,6,6,78,8,89,8,8,8,2,2,2,1,1,4,4,4,4,4,4,4,4,4,4,4,3,2,2,3,23,2,3,23,4,2,2,4,23,2,2,



xi = linspace(min(fpkm),max(fpkm),100)

fhat = kde(x=fpkm,gridsize=100,eval.points=xi) 

/* here I put digits=0. if I you do not round the numbers(yi) the results are a little bit changing.*/

yi = round(fhat$estimate,digits=0) 

mu = xi[which.max(yi)]

U = mean(fpkm[fpkm>mu])

sigma = (U-mu)* (sqrt(pi/2))

zFPKM = (fpkm - mu) / sigma


Cankut CUBUK

Computational Genomics Program - Systems Genomics Lab

Centro de Investigación Príncipe Felipe (CIPF)

C/ Eduardo Primo Yúfera nº3
46012 Valencia, Spain

ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by cankutcubuk150


I was trying to use your script in R, but maybe there are some problem with infinite values.
As far as I understood, the idea is use logFPKM as input, isn't it? 

when I runned the script I have this error:

Error en seq.default(x1, x2, length.out = n) :
  'from' cannot be NA, NaN or infinite

Maybe you can help me to resolve it?  thanks!!

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by tialuisi0

I already resolve the problem of infinite values, however the zFPKM that I obtained are all positive... so I consider that there is other problem...

ADD REPLYlink written 2.3 years ago by tialuisi0
gravatar for Charles Warden
3.9 years ago by
Charles Warden5.0k
Duarte, CA
Charles Warden5.0k wrote:

I use log2(RPKM + 0.1) values for differential expression to minimize the number of low coverage genes with unreasonably high differential expression values.

You can see the relevant benchmarks for various rounding values here:

The portions of that paper comparing differential expression algorithms are also summarized in this blog post:

ADD COMMENTlink written 3.9 years ago by Charles Warden5.0k
gravatar for jockbanan
4.0 years ago by
Czech Republic
jockbanan370 wrote:

From my (limited) experience: If you want more conservative results, try the newest approach from EdgeR package - function glmQLFTest(). It always gave me smaller number of significant genes than DEseq2 and classic EdgeR (in the form that is described in manual), as well as other tools like CLCBio. In addition, no or almost no genes reported by glmQLFTest() were tool-specific. It seems to report almost exactly genes reported by all the other tools I tried and no more, which makes me believe in these results a bit more. It may of course not work this way with your data and you will probably need to set some cutoff even if it will, but it is something to start with.

ADD COMMENTlink written 4.0 years ago by jockbanan370

thanks for this suggestion. I will try it later!

ADD REPLYlink written 4.0 years ago by liupfskygre180
gravatar for Michael Love
3.9 years ago by
Michael Love1.6k
United States
Michael Love1.6k wrote:

Can you clarify, is your question how to tell if a gene is expressed, or differentially expressed? 

In the latest release of DESeq2, we have implemented functionality to test for differential expression above a critical threshold, as sometimes users are not interested in all statistically significant genes, but actually wanted to find large fold changes. This can be accomplished in with:

results(dds, lfcThreshold=x, altHypothesis="greaterAbs")

Where x is the critical threshold specified by the user. The fold changes are in the log2 space, so x=1 finds fold changes > 2 and < 1/2, x=2 finds fold changes > 4 and < 1/4, etc.  Note that this will provide p-values testing the null hypothesis that |ß| < x, which is not the same as testing the null of |ß| = 0 and filtering out |ß| < x.

ADD COMMENTlink written 3.9 years ago by Michael Love1.6k

liupfskygre was hoping for a method to determine expression (presumably akin to that used by mas5 on affy arrays), mostly as a method for filtering DE results. Of course there's no great biological definition to go by there, so any answer would be a bit hand wavy. BTW, welcome to biostars.

ADD REPLYlink written 3.9 years ago by Devon Ryan77k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2009 users visited in the last hour