I want to test which genes affect a change in a condition. The condition is measured on a continuous scale. The data comes from micro-arrays and there are two batch effects to be accounted for.
I thought of performing a multiple linear regression on each gene separately. The continuous treatment would be the response variable, and the explanatory would be the gene expression and the two batch effects. Then I could take the p-value of the gene expression term and adjust it for multiple comparisons.
Is it a reasonable procedure? Are there much better ones?
Hi. The most relevant approach for me is the one explained in Resources for gene signature creation (continuous outcome).
Some specific questions I have (and feel free to refer me to learning resources instead of answering) :
My continuous outcome, strictly speaking has occurred before the gene expression. I think it does not matter when performing the regression what is the response and what is the explanatory variable. I am interested in the p-value (which is the same regardless of the direction). Does it matter for publication's sake? Conceptually it is nicer for me to put the continuous phenotype as the response, despite being earlier in time.
After finding individual statistically significant genes , is it necessary to build a "final" model including all of the significant (after BH correction) genes? Why is it necessary?
Afaik, if I am interested in finding genes that react differently to the continuous phenotype between two conditions, I should add a dummy variable for the conditions, and then the interaction variable (between the gene and the condition) would tell me that (if significant). I.e. Pheno ~ Gene + Condition + Gene:Condition. Is that correct? Thanks.
Indeed, in this case, for a simple univariable model of form
lm(y ~ x)
, it is not too important [the direction] if you only want to take the p-value. The beta coefficient should change, though.It is not necessary. The 'final' model comprising multi-variables (more than one gene) is just testing the combined effect of these in relation to the response.
I guess that is one way to do it, yes. This would seem to be the most obvious way.
I will perform the linear regression for the continuous variable versus each gene separately, and then look at the p-value. To assess the assumptions of the model, it seems that I need to look at the assumptions of the model for each gene separately. How can this be done for thousands of genes?
Try my Bioconductor package, RegParallel
Sorry, but I don't understand. RegParallel enables one to perform a large amount of tests simultaneously. How do I test the assumptions of those tests?
In linear regression, as far as I understand, I need to test the assumption of the normality and homoscedacity of the error terms. Usually it is done via plots (something RegParallel is not able to help with). Should I look at those plots for all genes that turn out to be significant ?