Hi all,

I am dealing with the RNASeq data. After mapping the reads to ref and obtaining the RPKM values for each gene, I want to normalized the expression values.

Starting from the RPKM values, I removed some lines with too much 0, and finally got 12K gene expression profiles.

The ranges of RPKM are 0 to 1e-6, which can not fit to the normal distribution.

I tried two methods to normalized the expression profiles:

1) assign the smallest value to 0, and then log2 transformed the data, the distribution look liked as the normal distribution, but it is actually not normalized, (do not fit N(0,1) distribution)

2) transforming the ranks of the expression values for each gene to their respective quantiles of a N(0; 1) distribution, however, the distribution profiles did not seem good enough.

So anyone has better solutions?

Thanks.

Just curious what is the need to transform the data into a normal distribution? There are normalization and analysis methods adapted for RNAseq data specifically -- Bullard et al (2010) is a good reference.

Thank you for your comment. The method you referred is mainly used to detect DE genes, which they claimed better than RPKM values. The fact is that I want to use RPKM values to represent the expression of genes.

The aim of transform the data to normal distribution is due to that the variance for a gene will flucuate too much if I adopt RPKM values. Just like using quantitle normalization to normalize the array expression and then using scale to make a normal distribution, any better solutions for RNASeq?

When you mention that the rank transformation (method 2) didn't produce good enough results, can you talk more about what exactly you mean by that?

The ranges of RPKM are 0 to 1e-6...That seems like a very narrow range. Did you mean

0 to 1e6instead?I also met your question. And I read some paper saying in the method part that, they done log2-transforming + mean-centering, such as Yue Li(2014) and TCGA-AML(2013). And this normalization is common even in the period of the microarray. The bioconductor R package affy offers the log2-transformed expression value.

Another question is how to do this step. Many RPKM is 0, so the log2 of 0 is -Inf. So this post is about the log2-transformed RPKM, saying that you can do

`log2(x+1)`

or`log2(x+0.25)`

.My idea is simple. Hope it's helpful.

Thanks