15 months ago by

University College London

Hey,

I will try to be as brief as possible and give you general points. Firstly, you may find this previous answer an interesting read: What is the best way to combine machine learning algorithms for feature selection such as Variable importance in Random Forest with differential expression analysis?

# Univariate

This obviously just involves testing each variable (gene) as an independent predictor of the outcome. You have Affymetrix microarrays. For processing these, you should default to the *oligo* package. *affy* is another package but it cannot work with the more modern 'ST' Affymetrix arrays.

Limma is still used to fit the regression model independently to each gene / probe-set. A simple workflow may be (you will have to adapt this to your own situation):

```
library("limma")
# 'oligo' is more suited for the Gene ST Affymetrix arrays
library("oligo")
# Read in the data
# readTargets will by default look for the 'FileName' column in the specified file
targetinfo <- readTargets("Targets.txt", sep="\t")
CELFiles <- list.celfiles("SampleFiles/", full.names = TRUE)
# Raw intensity data
project <- read.celfiles(CELFiles)
# Background correct, normalize, and calculate gene expression
project.bgcorrect.norm.avg <- rma(project, target="probeset")
# Create the study design and comparison model
design <- paste(targetinfo$CellPopulation, sep="")
design <- factor(design)
design
comparisonmodel <- model.matrix(~0+design)
colnames(comparisonmodel) <- levels(design)
comparisonmodel
# Extract the control probes from the chip
ControlProbes <- grep("AFFX",featureNames(project.bgcorrect.norm.avg))
ControlProbes
# Fit the linear model on the study's data
# lmfit uses generalized least squares (GLS), which is better for heteroscedastic data (i.e., data where the variances of sub-populations are different)
project.fitmodel <- lmFit(exprs(project.bgcorrect.norm.avg), comparisonmodel)
# Applying the empirical Bayes method to the fitted values acts as an extra normalization step and aims to bring the different probe-wise variances to common values
# This ultimately reduces the degrees of freedom of variances (i.e., less individual, different, variances).
project.fitmodel.eBayes <- eBayes(project.fitmodel)
names(project.fitmodel.eBayes)
# Make individual contrasts and fit the new model
Q1_contrastmodel <- makeContrasts(Q1="NM-NL", levels=comparisonmodel)
Q1_contrastmodel.fitmodel <- contrasts.fit(project.fitmodel.eBayes, Q1_contrastmodel)
Q1_contrastmodel.fitmodel.eBayes <- eBayes(Q1_contrastmodel.fitmodel)
topTable(Q1_contrastmodel.fitmodel.eBayes, adjust="fdr", coef="Q1", number=9999999, p.value=1)
```

# multivariate

Once you have identified statistically significant independent predictors (genes), you may then want to build a multivariate model using these statistically significant genes. This is where we enter the realm of multivariate logistic regression. I have posted a few answers:

Essentially, you may build a single combined model using `glm()`

(*MASS* package) and then use some procedure to further reduce the number of predictors. Techniques used here are varied but include:

- stepwise regression
- Lasso-penalised regression
- other classifiers like RandomForest™

In these models, you can include other variables, such as `age`

and `weight`

, and anything else that you want. The best model may well consist of a mixture of genes and clinical parameters.

When you do this, you must ensure that your variables are encoded correctly, i.e., as continuous or categorical variables. If they are categorical, then the categories (factors) must be ordered as per your own decision. The reference level is always positioned first in the ordered list.

# ROC analysis

Once you have settled on a final model(s), you can test the AUC. This can be done with pROC, with examples here:

## ------------------------------------

Hey, why don't you go and impress your new colleagues by also calculating sensitivity, specificity, precision, and accuracy, too? - A: How to calculate the sensitivity and specificity of set of expression dataset u

Kevin