**Edit September 22, 2019:**

Do not be naïve and assume that you can quickly learn these methods. I learned these while working with a PhD-level statistician. Consult a statistician in your lab / department, if possible.

## ---------

If you have already derived a panel of genes that are statistically significantly differentially expressed, then there are various tools to use.

# continuous outcome

If you are modelling a *continuous* trait (e.g. blood levels of inflammatory markers, metabolite levels, etc.), then build a linear model using `lm()`

:

For example:

```
lm(MyContinuousTrait ~ gene1 + gene2 + gene3 + ... + geneX, data=MyData)
```

You can then do stepwise regression to refine the model further based on the AIC or BIC using `stepAIC()`

from *MASS* package.

If you have many genes (e.g. > 30), then first test each gene independently, followed by building a combined model (like above) with the statistically significant genes:

```
lm(MyContinuousTrait ~ gene1, data=MyData) # p=0.8
lm(MyContinuousTrait ~ gene2, data=MyData) # p=0.03
lm(MyContinuousTrait ~ gene3, data=MyData) # p=0.67
lm(MyContinuousTrait ~ gene4, data=MyData) # p=0.02
MyModel <- lm(MyContinuousTrait ~ gene2 + gene4, data=MyData)
```

With this final model, you can still do stepwise regression to refine it even further.

For performing QC on this model::

```
library(car)
# check for outliers based on Cook's distances (2 different plots)
plot(cooks.distance(MyModel))
cutoff <- 4 / ((nrow(MyData) - length(MyModel$coefficients) - 1))
plot(MyModel, which=4, cook.levels=cutoff)
# Bonferonni p-value for most extreme observations
outlierTest(MyModel)
# check for non-normality via QQ plot for 'studentised' residuals
qqPlot(MyModel, main="QQ Plot")
# evaluate homoscedasticity
# non-constant error variance test
ncvTest(MyModel)
# multi-collinearity via variance inflation factors
vif(MyModel)
# a VIF > 2.5 indicates that the R^2 of a predictor to all other predictors is 0.6, i.e., collinear
sqrt(vif(MyModel)) > 2.5
# non-linearity via component + residual plot
crPlots(MyModel)
# test for Autocorrelated Errors
durbinWatsonTest(MyModel)
# global test of model assumptions
library(gvlma)
gvmodel <- gvlma(MyModel)
summary(gvmodel)
# k-fold cross-validation (here 3-fold)
library(DAAG)
cv.lm(MyData[!is.na(MyData$MyOutcomeTrait),], MyModel, m=3)
# R^2 shrinkage using 10-fold cross-validation
library(bootstrap)
theta.fit <- function(x, y) { lsfit(x,y) }
theta.predict <- function(fit, x) { cbind(1,x) %*% fit$coef }
x <- as.matrix(MyData[,which(colnames(MyData) %in% names(summary(MyModel)$coefficients[,1]))])
y <- as.matrix(MyData[,c("MyOutcomeTrait")])
results <- crossval(x, y, theta.fit, theta.predict, ngroup=10)
# Raw R^2
cor(y, MyModel$fitted.values)**2
# cross-validated R^2
cor(y, results$cv.fit)**2
```

Finally, if you have built a linear model, you can make a nice plot with *ggplot2* using the following code:

```
MyData$MyContinuousTrait.pred = predict(MyModel)
ggplot(MyData, aes(x=MyContinuousTrait, y=MyContinuousTrait.pred)) +
geom_point() + xlim(0,5) + ylim(0,5) +
geom_smooth(method='lm',formula=y~x) +
stat_smooth(method="lm", fullrange=TRUE) +
ggtitle("")
```

- X-axis: real values for the outcome variable
- Y-axis: model predicted values

# categorical outcome

If your outcome is *categorical*, i.e., a binary trait or has greater than 2 categories, then first build a binary or multinomial logistic regression model using `glm(..., family="binomial")`

with your predictors (genes). You can then choose the best predictors again based on the AIC or BIC using stepwise regression.

The checks for a `glm()`

differ from those for a `lm()`

:

```
# McFadden R^2 index used to assess the model fit
library(pscl)
pR2(MyModel)
# perform Chi^2 ANOVA
anova(MyModel, test="Chisq")
# perform cross validation
# the delta values should not greatly differ
# K=number of samples, i.e., leave-one-out CV.
library(boot)
cv.glm(MyData, MyModel, K=nrow(MyData))$delta
```

## ---------

For both `lm()`

and `glm()`

- based models, after you have chosen your best predictors and defined a final model, you can derive odds ratios (with confidence intervals) with:

```
# get ORs and confidence intervals
exp(cbind(OR=coef(MyModel), confint(MyModel)))
```

You can also perform ROC analysis using the *pROC* package.

Other options that can be used instead of `lm()`

/`glm()`

are lasso-penalized, elastic-net, and ridge regression (*glmnet* package).