Question

How to normalize RNA-seq quantification data with negative values?

1

Entering edit mode

10.5 years ago

Luyi Tian ▴ 120

I am analyzing a RNA-seq data set. It has been processed to remove technical variants, thus unlike raw RPKM values that start form 0, it contains some negative values. I am reluctant to add a constant so that every values is positive, then I could perform log transformation. Is there any other ways to transform the highly skewed distribution to standard normal distribution. I am a newbie in statistics.

Thanks in advance

RNA-Seq RPKM normalization • 6.4k views

ADD COMMENT • link updated 3.1 years ago by Ram 44k • written 10.5 years ago by Luyi Tian ▴ 120

1

Entering edit mode

Do you still have the raw data? It sounds like the processing was simply done incorrectly (or it's already on a log scale).

ADD REPLY • link 10.5 years ago by Devon Ryan 104k

1

Entering edit mode

the data is from a recent Nature article: http://www.nature.com/nature/journal/v501/n7468/full/nature12531.html

And they used a method called PEER to detected batch effects and experimental confounders:https://www.sanger.ac.uk/resources/software/peer/

no the processed data is heavily skewed and should not in a log scale.

Now I am reading the article about PEER and try to understand why it generate such data.

ADD REPLY • link updated 4.8 years ago by Ram 44k • written 10.5 years ago by Luyi Tian ▴ 120

0

Entering edit mode

Interesting, I'll have to read about how PEER differs from SVA/combat.

ADD REPLY • link 10.5 years ago by Devon Ryan 104k

0

Entering edit mode

PEER method is very similar to SVA. It tries to identify 'hidden' confounders and regress them out of your expression values. The only difference is that it uses a Bayesian approach to identify these hidden confounders. The resulting residuals will have both positive and negative numbers and represent relative expression. Its important to understand that these values are quite different from raw read counts. They only contain information on RELATIVE expression WITHIN a gene between samples. Dont add constants just use these values in your statistical analysis, if you are looking for differential expression between treatment groups.

ADD REPLY • link 10.5 years ago by lkmklsmn ▴ 980

0

Entering edit mode

How was the RPKM computed?

ADD REPLY • link 10.5 years ago by Bharat Iyengar ▴ 330

0

Entering edit mode

It won't be correct to deliberately convert a distribution to normal distribution. If you plot the gene expression values for different genes then you usually get a power-law kind of distribution rather than normal distribution.

Not all random variables are normally distributed. Statistically speaking, the sum of independent and identically distributed (IID) random variables will converge to normal distribution when n goes to infinity (That is the central limit theorem (CLT)).

Different genes are not IID random variables. They are different variables of a multivariate function and moreover they are not independent; if they were independent it would mean that there is no gene regulation.

Note: random variable is not really a variable, nor it is random. RV is a function.

If you measure expression of gene-x 100 times then this measure will follow normal distribution and this is in accordance with the CLT.

ADD REPLY • link 10.5 years ago by Bharat Iyengar ▴ 330