How to check which genes affect a continuous phenotype
1
0
Entering edit mode
5 months ago
Sam ▴ 170

I want to test which genes affect a change in a condition. The condition is measured on a continuous scale. The data comes from micro-arrays and there are two batch effects to be accounted for.

I thought of performing a multiple linear regression on each gene separately. The continuous treatment would be the response variable, and the explanatory would be the gene expression and the two batch effects. Then I could take the p-value of the gene expression term and adjust it for multiple comparisons.

Is it a reasonable procedure? Are there much better ones?

3
Entering edit mode
5 months ago

Hi Sam,

This is both a reasonable and standard approach to test a gene's expression against a continuous response variable. To use a linear regression, obviously, the residuals of your input parameters have to follow a normal distribution.

We developed RegParallel for this purpose, but I am sure that you can code it yourself: https://bioconductor.org/packages/release/data/experiment/vignettes/RegParallel/inst/doc/RegParallel.html

I have also posted code here, which I am confident will be of interest to you:

Regarding batch, you can include this as a covariate, of course, or you could treat one dataset as training and the other as validation, thus ignoring batch. Another approach would be to construct the models on one dataset and then use predict() on the new dataset.

Kevin

0
Entering edit mode

Hi. The most relevant approach for me is the one explained in Resources for gene signature creation (continuous outcome).

• How should I search for articles written using such a modeling approach? Perhaps you can suggest me one? Generally speaking, if you can suggest to me learning resources on that approach, that'd be great.

Some specific questions I have (and feel free to refer me to learning resources instead of answering) :

1. My continuous outcome, strictly speaking has occurred before the gene expression. I think it does not matter when performing the regression what is the response and what is the explanatory variable. I am interested in the p-value (which is the same regardless of the direction). Does it matter for publication's sake? Conceptually it is nicer for me to put the continuous phenotype as the response, despite being earlier in time.

2. After finding individual statistically significant genes , is it necessary to build a "final" model including all of the significant (after BH correction) genes? Why is it necessary?

3. Afaik, if I am interested in finding genes that react differently to the continuous phenotype between two conditions, I should add a dummy variable for the conditions, and then the interaction variable (between the gene and the condition) would tell me that (if significant). I.e. Pheno ~ Gene + Condition + Gene:Condition. Is that correct? Thanks.

1
Entering edit mode

I am interested in the p-value (which is the same regardless of the direction). Does it matter for publication's sake? Conceptually it is nicer for me to put the continuous phenotype as the response, despite being earlier in time.

Indeed, in this case, for a simple univariable model of form lm(y ~ x), it is not too important [the direction] if you only want to take the p-value. The beta coefficient should change, though.

After finding individual statistically significant genes , is it necessary to build a "final" model including all of the significant (after BH correction) genes? Why is it necessary?

It is not necessary. The 'final' model comprising multi-variables (more than one gene) is just testing the combined effect of these in relation to the response.

Afaik, if I am interested in finding genes that react differently to the continuous phenotype between two conditions, I should add a dummy variable for the conditions, and then the interaction variable (between the gene and the condition) would tell me that (if significant). I.e. Pheno ~ Gene + Condition + Gene:Condition. Is that correct? Thanks.

I guess that is one way to do it, yes. This would seem to be the most obvious way.

0
Entering edit mode

To use a linear regression, obviously, the residuals of your input parameters have to follow a normal distribution.

I will perform the linear regression for the continuous variable versus each gene separately, and then look at the p-value. To assess the assumptions of the model, it seems that I need to look at the assumptions of the model for each gene separately. How can this be done for thousands of genes?

1
Entering edit mode

Try my Bioconductor package, RegParallel

0
Entering edit mode

Sorry, but I don't understand. RegParallel enables one to perform a large amount of tests simultaneously. How do I test the assumptions of those tests?

In linear regression, as far as I understand, I need to test the assumption of the normality and homoscedacity of the error terms. Usually it is done via plots (something RegParallel is not able to help with). Should I look at those plots for all genes that turn out to be significant ?