Question: Normalization Of Gene Expression Using Rnaseq Rpkm Values
gravatar for J.F.Jiang
4.9 years ago by
J.F.Jiang700 wrote:

Hi all,

I am dealing with the RNASeq data. After mapping the reads to ref and obtaining the RPKM values for each gene, I want to normalized the expression values.

Starting from the RPKM values, I removed some lines with too much 0, and finally got 12K gene expression profiles.

The ranges of RPKM are 0 to 1e-6, which can not fit to the normal distribution.

I tried two methods to normalized the expression profiles:

1) assign the smallest value to 0, and then log2 transformed the data, the distribution look liked as the normal distribution, but it is actually not normalized, (do not fit N(0,1) distribution)

2) transforming the ranks of the expression values for each gene to their respective quantiles of a N(0; 1) distribution, however, the distribution profiles did not seem good enough.

So anyone has better solutions?


ADD COMMENTlink modified 4.9 years ago by Sean Davis24k • written 4.9 years ago by J.F.Jiang700

Just curious what is the need to transform the data into a normal distribution? There are normalization and analysis methods adapted for RNAseq data specifically -- Bullard et al (2010) is a good reference.

ADD REPLYlink written 4.9 years ago by kristen.dang10

Thank you for your comment. The method you referred is mainly used to detect DE genes, which they claimed better than RPKM values. The fact is that I want to use RPKM values to represent the expression of genes.

The aim of transform the data to normal distribution is due to that the variance for a gene will flucuate too much if I adopt RPKM values. Just like using quantitle normalization to normalize the array expression and then using scale to make a normal distribution, any better solutions for RNASeq?

ADD REPLYlink written 4.9 years ago by J.F.Jiang700

When you mention that the rank transformation (method 2) didn't produce good enough results, can you talk more about what exactly you mean by that?

ADD REPLYlink written 4.9 years ago by Devon Ryan81k

The ranges of RPKM are 0 to 1e-6... That seems like a very narrow range. Did you mean 0 to 1e6 instead?

ADD REPLYlink written 4.9 years ago by polarise370

I also met your question. And I read some paper saying in the method part that, they done log2-transforming + mean-centering, such as Yue Li(2014) and TCGA-AML(2013). And this normalization is common even in the period of the microarray. The bioconductor R package affy offers the log2-transformed expression value. 

Another question is how to do this step. Many RPKM is 0, so the log2 of 0 is -Inf. So RNAseq expression data log2 transformed has negative values. is about the log2-transformed RPKM, saying that you can do log2(x+1) or log2(x+0.25).

My idea is simple. Hope it's helpful.


ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by zju.whw30
gravatar for Sean Davis
4.9 years ago by
Sean Davis24k
National Institutes of Health, Bethesda, MD
Sean Davis24k wrote:

See the voom() function in the Bioconductor limma package or the vst functionality in DESeq2.

ADD COMMENTlink written 4.9 years ago by Sean Davis24k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1369 users visited in the last hour