Question: How to normalize RNA-seq quantification data with negative values?
gravatar for Luyi Tian
5.5 years ago by
Luyi Tian100
Luyi Tian100 wrote:

I am analyzing a RNA-seq data set. It has been processed to remove technical variants, thus unlike raw RPKM values that start form 0, it contains some negative values. I am reluctant to add a constant so that every values is positive, then I could perform log transformation. Is there any other ways to transform the highly skewed distribution to standard normal distribution. I am a newbie in statistics.

Thanks in advance

normalization rna-seq rpkm • 3.7k views
ADD COMMENTlink written 5.5 years ago by Luyi Tian100

Do you still have the raw data? It sounds like the processing was simply done incorrectly (or it's already on a log scale).

ADD REPLYlink written 5.5 years ago by Devon Ryan93k

the data is from a recent Nature article:

And they used a method called PEER to detected batch effects and experimental confounders:

no the processed data is heavily skewed and should not in a log scale.

Now I am reading the article about PEER and try to understand why it generate such data.

ADD REPLYlink written 5.5 years ago by Luyi Tian100

Interesting, I'll have to read about how PEER differs from SVA/combat.

ADD REPLYlink written 5.5 years ago by Devon Ryan93k

PEER method is very similar to SVA. It tries to identify 'hidden' confounders and regress them out of your expression values. The only difference is that it uses a Bayesian approach to identify these hidden confounders. The resulting residuals will have both positive and negative numbers and represent relative expression. Its important to understand that these values are quite different from raw read counts. They only contain information on RELATIVE expression WITHIN a gene between samples. Dont add constants just use these values in your statistical analysis, if you are looking for differential expression between treatment groups.

ADD REPLYlink written 5.5 years ago by lkmklsmn920

How was the RPKM computed?

ADD REPLYlink written 5.5 years ago by Bharat Iyengar270

It won't be correct to deliberately convert a distribution to normal distribution. If you plot the gene expression values for different genes then you usually get a power-law kind of distribution rather than normal distribution.

Not all random variables are normally distributed. Statistically speaking, the sum of independent and identically distributed (IID) random variables will converge to normal distribution when n goes to infinity (That is the central limit theorem (CLT)).

Different genes are not IID random variables. They are different variables of a multivariate function and moreover they are not independent; if they were independent it would mean that there is no gene regulation.

Note: random variable is not really a variable, nor it is random. RV is a function.

If you measure expression of gene-x 100 times then this measure will follow normal distribution and this is in accordance with the CLT.

ADD REPLYlink written 5.5 years ago by Bharat Iyengar270
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1706 users visited in the last hour