How to normalize RPKM values to use in regression models ?
3
2
Entering edit mode
10.3 years ago
jack ▴ 980

Hi,

I have expression values in terms of RPKM and RPM. I looked at the values of features which are in RPKM scale, are between 0-10000. so, before training model, I think it's necessary to normalize the data to bring them in smaller range of scale like 0-1 to avoid masking effect of the features with higher feature values.

I think people use log(intensities) in microarray. what is the reasonable way in RNA-seq data(lot's of zero values in RNA-seq)?

count-data rpkm RNA-Seq regression transformation • 7.3k views
ADD COMMENT
3
Entering edit mode
10.3 years ago
Michael 55k

I think there is no accepted consensus how to do this best. Please share your experiences with us. The only way to find out definitively is to evaluate and compare different methods by e.g. n-fold cross-validation. This will allow us to give better advice in the future.

Possibly you should not transform at all in this case. This paper contains some interesting arguments, even though it is related to counts in ecology, it should be applicable also to RNA-seq counts. So, the recommended alternative to a linear model on log-transformed data (plus eventually adding a pseudocount or using vsn) would be to use a negative-binomial fit by a generalized linear model using the raw counts. Theoretically, this should be a good method as it seems to be current consensus to model RNA-seq variance by the negative binomial distribution. We have already collected a lot of evidence against the use of RPKM/FPKM elsewhere. RPKM is a convolution of length and library size normalization which introduces a library specific bias instead of removing it. In theory it might be better to model gene length explicitly as a model parameter if you think it is important, then you can check for a significant impact of it by comparing models.

glm.nb is an R-function for doing the model fit

An alternative might be to use variance-stabilizing transformation (vsn in R) on RPM values.

This Plos ONE article compares different transformations for regression.

ADD COMMENT
1
Entering edit mode

That's a nice paper. I really appreciate that someone finally looked at this question with some worthwhile simulations.

ADD REPLY
0
Entering edit mode

Yes, it's one of the more enjoyable paper I have read. thanks Michael ;-)

ADD REPLY
0
Entering edit mode

Thanks Michael, sure, could you please give the link of the first paper? (called "This paper"). when I click on it to get it, it's broken link.

ADD REPLY
1
Entering edit mode

I have corrected the link, here is the DOI: 10.1111/j.2041-210X.2010.00021.x, it's not pubmed indexed. I don't think it is by any means an authoritative answer, but contains interesting arguments. (And it explains your question in the Introduction)

ADD REPLY
0
Entering edit mode

@Michael Dndrp, it was very interesting paper, and it seems that it ranked based bloom transformation work better. but ranking does not lose lots of information? Also do you know, how can I find the codes for the different transformations?

ADD REPLY
1
Entering edit mode
10.3 years ago

Just add a small value (0.01, perhaps) and then take the log. You never get 0 values in microarrays due to background fluorescence and binding, which adding a small value somewhat mimics.

ADD COMMENT
1
Entering edit mode
10.3 years ago

This might sound too simple, but I think ranking is the probably the best way to transform the data.

Depending on how you handled multi-mapped reads in your tag counting, I don't think the inter-sample comparisons of gene expression is valid. For example, I don't think you can compare the expression value of gene A vs gene B within one sample. If you only used uniquely mapped reads for your tag count, then you are introducing a "sequence redundancy" bias into your expression values. Transcripts that share common domains (thus generating multi-mapped reads) will be artificially under-counted vs transcripts that are totally unique.

If the inter-sample differences can't really be trusted, then a simple ranking is all that really matters.

But if you used some kind of multi-mapping strategy (RSEM, express), maybe standardization or variant stabilization will be more valid?

ADD COMMENT
0
Entering edit mode

How the ranking work. I couldn't understand why it's lead to uniform distribution?

ADD REPLY
0
Entering edit mode

How should I do the ranking and why it's lead to uniform distribution?

ADD REPLY

Login before adding your answer.

Traffic: 792 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6