Question: How to normalize RPKM values to use in regression models ?
gravatar for jack
5.3 years ago by
jack790 wrote:


I have expression values in terms of RPKM and RPM. I looked at the values of features which are in RPKM scale, are between 0-10000. so, before training model, I think it's nessesary to normalize the data to bring them in samller range of scale like 0-1 to avoid  masking effect of the features with higher feature values.

I think poeple use log(intensities) in microarray. what is the reasonable way in RNA-seq data(lot's of zero values in RNA-seq) ?

ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by jack790
gravatar for Michael Dondrup
5.3 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

I think there is no accepted consensus how to do this best. Please share your experiences with us. The only way to find out definitively is to evaluate and compare different methods by e.g. n-fold cross-validation. This will allow us to give better advice in the future.

Possibly you should not transform at all in this case. This paper contains some interesting arguments, even though it is related to counts in ecology, it should be applicable also to RNA-seq counts. So, the recommended alternative to a linear model on log-transformed data (plus eventually adding a pseudocount or using vsn)  would be to use a negative-binomial fit by a generalized linear model using the raw counts. Theoretically, this should be a good method as it seems to be current consensus  to model RNA-seq variance by the negative binomial distribution. We have already collected a lot of evidence against the use of RPKM/FPKM elsewhere. RPKM is a convolution of length and library size normalization which introduces a library specific bias instead of removing it. In theory it might be better to model gene length explicitly as a model parameter if you think it is important, then you can check for a significant impact of it by comparing models. 

glm.nb  is an R-function for doing the model fit

An alternative might be to use variance-stabilizing transformation (vsn in R) on RPM values.

This Plos ONE article compares different transformations for regression.




ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by Michael Dondrup46k

That's a nice paper. I really appreciate that someone finally looked at this question with some worthwhile simulations.

ADD REPLYlink written 5.3 years ago by Devon Ryan92k

Yes, it's one of the more enjoyable paper I have read. thanks Michael ;-)

ADD REPLYlink written 5.3 years ago by jack790

Thanks Michael, sure, could you please give the link of the first paper? (called  "This paper"). when i click on it to get it, it's borken link.


ADD REPLYlink written 5.3 years ago by jack790

I have corrected the link, here is the DOI: 10.1111/j.2041-210X.2010.00021.x, it's not pubmed indexed. I don't think it is by any means an authoritative answer, but contains interesting arguments.  (And it explains your question in the Introduction)

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by Michael Dondrup46k

@Michael Dndrp, it was very intresting paper, and it seems that it ranked based bloom transformation work better. but ranking does not lose lots of informatins ? also do you know, how can I find the codes for the different transformations ?


ADD REPLYlink written 5.2 years ago by jack790
gravatar for Devon Ryan
5.3 years ago by
Devon Ryan92k
Freiburg, Germany
Devon Ryan92k wrote:

Just add a small value (0.01, perhaps) and then take the log. You never get 0 values in microarrays due to background fluorescence and binding, which adding a small value somewhat mimics.

ADD COMMENTlink written 5.3 years ago by Devon Ryan92k
gravatar for Damian Kao
5.3 years ago by
Damian Kao15k
Damian Kao15k wrote:

This might sound too simple, but I think ranking is the probably the best way to transform the data.

Depending on how you handled multi-mapped reads in your tag counting, I don't think the inter-sample comparisons of gene expression is valid. For example, I don't think you can compare the expression value of gene A vs gene B within one sample. If you only used uniquely mapped reads for your tag count, then you are introducing a "sequence redundancy" bias into your expression values. Transcripts that share common domains (thus generating multi-mapped reads) will be artificially under-counted vs transcripts that are totally unique. 

If the inter-sample differences can't really be trusted, then a simple ranking is all that really matters. 

But if you used some kind of multi-mapping strategy (RSEM, express), maybe standardization or variant stabilization will be more valid? 

ADD COMMENTlink written 5.3 years ago by Damian Kao15k

How the ranking work. I couldn't understand why it's lead to uniform distribution?   

ADD REPLYlink written 5.2 years ago by jack790

How should I do the ranking and why it's lead to uniform distribution ?


ADD REPLYlink written 5.2 years ago by jack790
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2041 users visited in the last hour