Question

How can I find the correct transformation for a continuous covariate in a gene expression linear model?

3

Entering edit mode

8.9 years ago

Ryan Thompson ★ 3.6k

Sometimes, in an experiment, I want to model RNA expression as a function of some continuous variable such as age, dose, or time after treatment, using a linear model. Doing this is easy enough, but the problem is that, as the name suggests, the log expression is modelled as a linear function on the covariate in question. But how do I know that a linear relationship is the correct one? What if the covariate needs to be log-transformed, or square-root transformed? How would I figure that out? Obviously I could try a bunch of common functions and see which one works "best", but that constitutes data snooping. Also, simply plotting expression vs the covariate of interest might work if there is only one covariate, but it will be less effective if there are multiple such covariates.

So, is there a statistically principled way to determine the appropriate transformation for a continuous covariate in a linear model?

RNA-Seq linear-modeling covariates • 3.0k views

ADD COMMENT • link 8.9 years ago by Ryan Thompson ★ 3.6k

1

Entering edit mode

Are we excluding doing a pilot experiment or subsetting the data and doing the snooping and testing on different subsets? I strongly suspect that those are the only really reliable methods without data snooping (assuming no a priori knowledge about what the covariate relationship might reasonably be like).

ADD REPLY • link 8.9 years ago by Devon Ryan 104k

0

Entering edit mode

Yes, I'm asking if there's a way to determine the appropriate transformation from the data itself. Perhaps by discovering the globally optimal transformation across all genes, so that any one gene only contributes a tiny fraction and data snooping is minimized?

ADD REPLY • link 8.9 years ago by Ryan Thompson ★ 3.6k

0

Entering edit mode

I suspect that the answer will be that the closest you can get is to try to interpret a PCA plot. The data snooping there is about as low as you're going to get. You might want to post this to cross-validated and see what the statistics folks think, hopefully they know of a better option.

ADD REPLY • link 8.9 years ago by Devon Ryan 104k