Question: Performing univariate and multivariate logistic regression in gene expression data
gravatar for dodausp
15 months ago by
dodausp140 wrote:

Hi everyone, I have recently moved to clinical settings in a hospital, and as my first assignment I am working with microarray data (Affymetrix) from cancer patients and I am facing a big challenge: how to adapt my experience in basic research to clinical application?

As, I presume, a lot of people here, I am quite familiar with using limma for linear model in microarray and RNAseq data for finding differentially expressed genes (DEGs). The matrix design is quite straight forward. However, I have been introduced to the methodology on our clinical settings and this is the pipeline they use for discovering DEGs:

  1. univariate logistic regression (P<0.05)
  2. to find DEGs that are associated to the cancer type
  3. less stringent to allow for more genes to be discovered
  4. multivariate logistic regression (P<0.01 or less)
  5. analysis performed only for those genes that passed the first analysis
  6. more stringent analysis to sort of validate what was found on the first step
  7. multivariate analysis (P<0.05 or less)
  8. performed only on those genes that passed the second analysis
  9. multivariate de facto: considering other variates such as levels of biomarkers (e.g. CA-125), age and weight
  10. ROC/AUC curves
  11. performed only for those genes that passed analysis 3

Quite frankly I am a bit lost with all those steps (i.e. why not start from step 2 directly?), and consequently in what packages and functions to use in R. I have tried this thread, and this one, but I believe they do not answer my questions to the fullest. So, any help in that regard (i.e. which tools to use, how to design the contrast matrix) will be extremely appreciated!

One last thing, for microarray data would one use gene expression as a response variable and quantitative traits (i.e. normal, cancer) as explanatory, or the other way around? Doesn't limma automatically set the former case?

Thanks a lot in advance!

ADD COMMENTlink modified 15 months ago by Kevin Blighe61k • written 15 months ago by dodausp140
gravatar for Kevin Blighe
15 months ago by
Kevin Blighe61k
University College London
Kevin Blighe61k wrote:


I will try to be as brief as possible and give you general points. Firstly, you may find this previous answer an interesting read: What is the best way to combine machine learning algorithms for feature selection such as Variable importance in Random Forest with differential expression analysis?


This obviously just involves testing each variable (gene) as an independent predictor of the outcome. You have Affymetrix microarrays. For processing these, you should default to the oligo package. affy is another package but it cannot work with the more modern 'ST' Affymetrix arrays.

Limma is still used to fit the regression model independently to each gene / probe-set. A simple workflow may be (you will have to adapt this to your own situation):


# 'oligo' is more suited for the Gene ST Affymetrix arrays

# Read in the data
# readTargets will by default look for the 'FileName' column in the specified file
targetinfo <- readTargets("Targets.txt", sep="\t")
CELFiles <- list.celfiles("SampleFiles/", full.names = TRUE)

# Raw intensity data
project <- read.celfiles(CELFiles)

# Background correct, normalize, and calculate gene expression
project.bgcorrect.norm.avg <- rma(project, target="probeset")

# Create the study design and comparison model
design <- paste(targetinfo$CellPopulation, sep="")
design <- factor(design)

comparisonmodel <- model.matrix(~0+design)
colnames(comparisonmodel) <- levels(design)

# Extract the control probes from the chip
ControlProbes <- grep("AFFX",featureNames(project.bgcorrect.norm.avg))

# Fit the linear model on the study's data
# lmfit uses generalized least squares (GLS), which is better for heteroscedastic data (i.e., data where the variances of sub-populations are different)
project.fitmodel <- lmFit(exprs(project.bgcorrect.norm.avg), comparisonmodel)

# Applying the empirical Bayes method to the fitted values acts as an extra normalization step and aims to bring the different probe-wise variances to common values
# This ultimately reduces the degrees of freedom of variances (i.e., less individual, different, variances).
project.fitmodel.eBayes <- eBayes(project.fitmodel)

# Make individual contrasts and fit the new model
Q1_contrastmodel <- makeContrasts(Q1="NM-NL", levels=comparisonmodel)
Q1_contrastmodel.fitmodel <-, Q1_contrastmodel)
Q1_contrastmodel.fitmodel.eBayes <- eBayes(Q1_contrastmodel.fitmodel)

topTable(Q1_contrastmodel.fitmodel.eBayes, adjust="fdr", coef="Q1", number=9999999, p.value=1)


Once you have identified statistically significant independent predictors (genes), you may then want to build a multivariate model using these statistically significant genes. This is where we enter the realm of multivariate logistic regression. I have posted a few answers:

Essentially, you may build a single combined model using glm() (MASS package) and then use some procedure to further reduce the number of predictors. Techniques used here are varied but include:

  • stepwise regression
  • Lasso-penalised regression
  • other classifiers like RandomForest™

In these models, you can include other variables, such as age and weight, and anything else that you want. The best model may well consist of a mixture of genes and clinical parameters.

When you do this, you must ensure that your variables are encoded correctly, i.e., as continuous or categorical variables. If they are categorical, then the categories (factors) must be ordered as per your own decision. The reference level is always positioned first in the ordered list.

ROC analysis

Once you have settled on a final model(s), you can test the AUC. This can be done with pROC, with examples here:


Hey, why don't you go and impress your new colleagues by also calculating sensitivity, specificity, precision, and accuracy, too? - A: How to calculate the sensitivity and specificity of set of expression dataset u


ADD COMMENTlink written 15 months ago by Kevin Blighe61k

Wow, this is just mesmerizing, Kevin! I immensely appreciate your help. In fact, this was more than that. It was a lecture. And that bonus in the end, for the sens. and spec., was just to top that all up. Thank you so much for taking the time to address my questions so thoroughly. I will go over it, together with the links you added, just as carefully. And I hope it is ok to come back if there's something I am stuck at. This community never ceases to amaze me. A thousand thanks!

ADD REPLYlink written 15 months ago by dodausp140

Sure thing - no problem.

ADD REPLYlink written 15 months ago by Kevin Blighe61k

I bellieve am dealing with a complete or quasi-complete separation case. When trying to perform a multivariate analysis for each variable separately, by following the example, I am getting this:

glm(Serous ~ ., data = df, family = binomial(link = 'logit'), control = list(maxit = 50))

Warning message: fitted probabilities numerically 0 or 1 occurred

I read some threads recommending bayesglm() or glmnet(). But first; I would like to know if there is anything I can do to optimize my data to use your proposed "Building a predictive model by a list of genes"?

PS1: the same issue appears when I run the combined model

PS2: the data being used is straight away from rma() normalization

PS3: I added the "control" argument to the glm() function because I was getting the following warning message:

In = X, y = Y, weights = weights, start = start, etastart = etastart, : algorithm did not converge

Any help is much appreciated again.

ADD REPLYlink modified 15 months ago • written 15 months ago by dodausp140

How many variables are in the glm()? bayesglm and glmnet (lasso-penalised regression) can work with a large number of variables - glm() cannot.

ADD REPLYlink written 15 months ago by Kevin Blighe61k

I believe that was exactly my problem. I started with around 1,200 variables (genes), going down to 47 - cutting off by p-value. It worked on the last case. In that case, would you recommend to keep my list of selected variables as low as possible, or would it be worth trying the other methods - bayesglm and glmnet? I know it is briefly commented on this post. I am just afraid to lose some good predictors by excluding them to fit the glm() function. And many thanks again!

ADD REPLYlink written 15 months ago by dodausp140

I think that it is worth trying various things, to be honest, and then comparing the AUC that you obtain for each. The key variables should be identified by each method.

When you get your 47 variables, by the way, that would be ideal for stepwise regression to further reduce that down. Stepwise regression will likely receive some frowns from Statisticians with whom you work, though. Just always check the models that it chooses and be acutely aware of the standard error.

If you also receive AUCs of 1 or they have very wide confidence intervals, be very suspicious. Same is true for confidence intervals from your models even before you generate ROC curves: if you see confidence intervals like 0.000004 or 12342.77, then that is almost definitely error.

ADD REPLYlink written 15 months ago by Kevin Blighe61k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1475 users visited in the last hour