**Edit: answer updated 20th July 2018**

If you have already derived a panel of genes that are, I assume, statistically significantly differentially expressed, then there are various tools to use.

If your outcome is a binary trait (or has greater than 2 categories), then first build a binary or multinomial logistic regression model using `glm()`

with your predictors (genes). You can then choose the best predictors based on the AIC or BIC using stepwise regression (`stepAIC()`

from *MASS* package). After you have chosen your best predictors, you can perform ROC analysis using the *pROC* package. Other options for binary or traits >2 categories are lasso-penalized, elastic-net, and ridge regression (*glmnet* package). These may take a while to get the hang of but it is worth knowing how to use them properly.

If you are modelling a continuous trait (e.g. blood levels of inflammatory markers, protein expression, metabolite levels, etc.), then do the same as above but build a linear model using `lm()`

.

For example:

```
lm(MyContinuousTrait ~ gene1 + gene2 + gene3 + ... + geneX, data=MyData)
```

You can then do stepwise regression to refine the model further.

If you have many genes (e.g. > 30), then first test each gene independently and then build a combined model (like above) with the statistically significant genes:

```
lm(MyContinuousTrait ~ gene1, data=MyData) #not statistically significant
lm(MyContinuousTrait ~ gene2, data=MyData) #statistically significant
lm(MyContinuousTrait ~ gene3, data=MyData) #not statistically significant
lm(MyContinuousTrait ~ gene4, data=MyData) #statistically significant
lm(MyContinuousTrait ~ gene5, data=MyData) #statistically significant
MyModel <- lm(MyContinuousTrait ~ gene2 + gene4 + gene5, data=MyData)
```

With this final model, you can still do stepwise regression to refine it further before further testing.

With your final model, you need to check your models to ensure that they are actually robust.

```
library(car)
#Check for outliers and influencial samples based on Cook's distances (2 different plots)
plot(cooks.distance(MyModel))
cutoff <- 4/((nrow(MyData)-length(MyModel$coefficients)-1))
plot(MyModel, which=4, cook.levels=cutoff)
#Bonferonni p-value for most extreme observations
outlierTest(MyModel)
#Check for non-normality via QQ plot for studentized residuals
qqPlot(MyModel, main="QQ Plot")
#Evaluate homoscedasticity
#Non-constant error variance test
ncvTest(MyModel)
#Multi-collinearity via variance inflation factors
vif(MyModel)
#VIF>2.5 indicates R-squared of a predictor to all other predictors of 0.6
sqrt(vif(MyModel)) > 2.5
#Non-linearity via component + residual plot
crPlots(MyModel)
#Test for Autocorrelated Errors
durbinWatsonTest(MyModel)
#Global test of model assumptions
library(gvlma)
gvmodel <- gvlma(MyModel)
summary(gvmodel)
#K-fold cross-validation (here 3-fold)
library(DAAG)
pdf("MyModel.CrossValidated.pdf")
cv.lm(MyData[!is.na(MyData$MyOutcomeTrait),], MyModel, m=3)
dev.off()
#Assessing R^2 shrinkage using 10-Fold Cross-Validation
library(bootstrap)
theta.fit <- function(x,y){lsfit(x,y)}
theta.predict <- function(fit,x){cbind(1,x)%*%fit$coef}
x <- as.matrix(MyData[,which(colnames(MyData) %in% names(summary(MyModel)$coefficients[,1]))])
y <- as.matrix(MyData[,c("MyOutcomeTrait")])
results <- crossval(x, y, theta.fit, theta.predict, ngroup=10)
#Raw R2
cor(y, MyModel$fitted.values)**2
#Cross-validated R2
cor(y, results$cv.fit)**2
```

Finally, if you have built a linear model, you can make a nice plot with `ggplot2`

using the following code:

```
par(cex=0.8, mar=c(5,5,5,5))
MyData$MyContinuousTrait.pred = predict(MyModel)
p1 <- ggplot(MyData, aes(x=MyContinuousTrait, y=MyContinuousTrait.pred)) +
geom_point() + xlim(0,5) + ylim(0,5) +
geom_smooth(method='lm',formula=y~x) +
stat_smooth(method="lm", fullrange=TRUE) + #Extend the fitted line for points where no values exist (such as 0 or 5)
ggtitle("My linear model")
plot(p1)
```

By working with this code a fair bit, you can produce nice graphics like this:

In the plot, x-axis designates the real values for the outcome variable, whilst the y-axis indicate the model predicted values.