Question: Resources for gene signature creation
gravatar for mforde84
2.5 years ago by
mforde841.2k wrote:


I have a three groups with 20-30 samples each, and I'd like to generate predictive gene signatures for each. Do any of you have suggestions on R packages that you've used in the past and that I can look into to generate these types of signatures?

Kindly, Martin

signatures gene expression • 2.6k views
ADD COMMENTlink modified 2.5 years ago by Kevin Blighe56k • written 2.5 years ago by mforde841.2k
gravatar for Kevin Blighe
2.5 years ago by
Kevin Blighe56k
Kevin Blighe56k wrote:

Edit September 22, 2019:

Do not be naïve and assume that you can quickly learn these methods. I learned these while working with a PhD-level statistician. Consult a statistician in your lab / department, if possible.


If you have already derived a panel of genes that are statistically significantly differentially expressed, then there are various tools to use.

continuous outcome

If you are modelling a continuous trait (e.g. blood levels of inflammatory markers, metabolite levels, etc.), then build a linear model using lm():

For example:

lm(MyContinuousTrait ~ gene1 + gene2 + gene3 + ... + geneX, data=MyData)

You can then do stepwise regression to refine the model further based on the AIC or BIC using stepAIC() from MASS package.

If you have many genes (e.g. > 30), then first test each gene independently, followed by building a combined model (like above) with the statistically significant genes:

lm(MyContinuousTrait ~ gene1, data=MyData) # p=0.8
lm(MyContinuousTrait ~ gene2, data=MyData) # p=0.03
lm(MyContinuousTrait ~ gene3, data=MyData) # p=0.67
lm(MyContinuousTrait ~ gene4, data=MyData) # p=0.02

MyModel <- lm(MyContinuousTrait ~ gene2 + gene4, data=MyData)

With this final model, you can still do stepwise regression to refine it even further.

For performing QC on this model::


# check for outliers based on Cook's distances (2 different plots)
cutoff <- 4 / ((nrow(MyData) - length(MyModel$coefficients) - 1)) 
plot(MyModel, which=4, cook.levels=cutoff)

# Bonferonni p-value for most extreme observations

# check for non-normality via QQ plot for 'studentised' residuals
qqPlot(MyModel, main="QQ Plot")

# evaluate homoscedasticity
# non-constant error variance test

# multi-collinearity via variance inflation factors 
# a VIF > 2.5 indicates that the R^2 of a predictor to all other predictors is 0.6, i.e., collinear
sqrt(vif(MyModel)) > 2.5

# non-linearity via component + residual plot

# test for Autocorrelated Errors

# global test of model assumptions
gvmodel <- gvlma(MyModel) 

# k-fold cross-validation (here 3-fold)
cv.lm(MyData[!$MyOutcomeTrait),], MyModel, m=3)

# R^2 shrinkage using 10-fold cross-validation
library(bootstrap) <- function(x, y) { lsfit(x,y) }
theta.predict <- function(fit, x) { cbind(1,x) %*% fit$coef }
x <- as.matrix(MyData[,which(colnames(MyData) %in% names(summary(MyModel)$coefficients[,1]))])
y <- as.matrix(MyData[,c("MyOutcomeTrait")]) 
results <- crossval(x, y,, theta.predict, ngroup=10)
# Raw R^2 
cor(y, MyModel$fitted.values)**2
# cross-validated R^2
cor(y, results$**2

Finally, if you have built a linear model, you can make a nice plot with ggplot2 using the following code:

MyData$MyContinuousTrait.pred = predict(MyModel)
ggplot(MyData, aes(x=MyContinuousTrait, y=MyContinuousTrait.pred)) +
  geom_point() + xlim(0,5) + ylim(0,5) +
  geom_smooth(method='lm',formula=y~x) +
  stat_smooth(method="lm", fullrange=TRUE) +


  • X-axis: real values for the outcome variable
  • Y-axis: model predicted values

categorical outcome

If your outcome is categorical, i.e., a binary trait or has greater than 2 categories, then first build a binary or multinomial logistic regression model using glm(..., family="binomial") with your predictors (genes). You can then choose the best predictors again based on the AIC or BIC using stepwise regression.

The checks for a glm() differ from those for a lm():

# McFadden R^2 index used to assess the model fit

# perform Chi^2 ANOVA
anova(MyModel, test="Chisq")

# perform cross validation
# the delta values should not greatly differ
# K=number of samples, i.e., leave-one-out CV.
cv.glm(MyData, MyModel, K=nrow(MyData))$delta


For both lm() and glm() - based models, after you have chosen your best predictors and defined a final model, you can derive odds ratios (with confidence intervals) with:

# get ORs and confidence intervals
exp(cbind(OR=coef(MyModel), confint(MyModel)))

You can also perform ROC analysis using the pROC package.

Other options that can be used instead of lm()/glm() are lasso-penalized, elastic-net, and ridge regression (glmnet package).

ADD COMMENTlink modified 6 months ago • written 2.5 years ago by Kevin Blighe56k

Thanks for the suggestions, Kevin.

It's a categorical factor. We integrated mRNA and and miRNA data with SNFtool to generate three clusters that show DEG, pathway, and survival differences. However for validation purposes, it's difficult to find independent datasets in the wild which are of our specific tissue type and which have mRNA, miRNA, and survival data. So to work around this, they want to generate mRNA signatures for each and perform single sample GSVA on an independent dataset and determine if enrichment scores correlate with outcome.

For the glm, and excuse me if this is a silly question, are you suggesting something like a multivariate regression model (e.g., gene_1 + gene_n <- factor), or multiple regressions (e.g., gene_1 <- factor, gene_2 <- factor)? I assume this is already addressed within the stepAIC() documentation, maybe?

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by mforde841.2k

Hey, yes, I was suggesting a multivariate model such as:

glm(factor ~ gene1 + gene2 + gene3 + gene4 + ... geneN)

With stepAIC you first define your full model, minimal model (which can be just glm(factor ~ 1)) and then run either a forward, backward, or both stepwise regression. This then returns a final model that it deems to be most predictive of your factor of interest, based on either the AIC or BIC criteria. It may return just a few of your genes - who knows. Based on the fact that your genes are already DEGs, I would hope that your final model is pretty good. A ROC curve using the pROC package should also show this with a high AUC.

Also, stepwise regression is not great for very large numbers of variables. If you have around 20 or 30 DEGs, then that's optimal, I believe. Also, some statisticians will probably frown if you just let the stepwise function do everything for you. You should check the model summaries with the summary() function.

Also, now that I know that it's a multinomial logistic regression model, the QC on the model differs. Here is some further code that may assist in assessing the model (NB - I have updated my initial answer, above, to include this):

[see original answer]

Obviously your choice of factor reference level is key too.

Trust that this assists you.

ADD REPLYlink modified 14 months ago • written 2.5 years ago by Kevin Blighe56k

Oh ok, I see. This makes a lot of sense.

Unfortunately the number of DEG is high (~2k unique across all three groups). They are tumor samples, so heterogeneity is expected. At the very least it makes sense to go back and double check the #DEG I have is accurate, and contrast the limma results with an alternate method like DESeq. I could also take the top 10 ranked DEG by logFC in each SNF group, or apply more conservative limits on logFC and/or adjPvalue.

Worse case scenario, assuming I end up with a large number of possible predictors, would it be helpful to do a random forest instead of the stepwise regression and address potential over-fitting by limiting the max number of randomly selected predictors (e.g., 20-30) included in each random decision tree? The issue I see with this is that the number of permutations I'd have to run to cover possible combinations of predictors would be astronomical. Maybe instead of doing this, I could take the DEG and do a cox regression for each and select top 30 with largest absolute hazard ratios? Ultimately what we are most interested in is whats driving the survival differences, so this seems to be a logical data driven approach to reduce the number of potential genes.

Again thanks for your thoughts on this, this is very helpful for me.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by mforde841.2k

You seem to have a good grasp of the options available already. Stepwise regression is definitely not suitable with that many variables; however, lasso-penalized regression would work and may give you only a couple dozen variables in return. With the lasso regression (alpha=1.0), you can additionally adjust the alpha 'penalty' parameter in order to change the number of predictors that are thrown out (and thus the number that remain). Alpha of 0 is ridge regression, where no variables are thrown out of the model and each is assigned a beta value; whereas anything in between 1.0 and 0.0 is elastic regression.

In light of the fact that it's cancer data and that your interest is in survival, the cox proportional hazards regression idea is good and I assume you meant that you'd test each independently and then pick the top 30? That's certainly another acceptable way to do this type of thing. Edit 5th February 2019: I later wrote a tutorial for just this purpose: Survival analysis with gene expression

Random Forests, I cannot comment too much - it's one area that I'm yet to cover. I had an American colleague in Boston who did some great work on those but I did not get a chance to learn from his ideas/efforts.

ADD REPLYlink modified 14 months ago • written 2.5 years ago by Kevin Blighe56k

Great suggestion. I think I'm going to try the lasso regression and see what I pull out.

ADD REPLYlink written 2.5 years ago by mforde841.2k

Hi, I was wondering how you implemented this in the end? I am faced with a similar task at the moment and I am wondering how to do this properly. Thanks!

ADD REPLYlink written 19 months ago by JJ500

You mean lasso regression? I share some code/ideas here:

There are also tutorials online, if you search via a search engine.

ADD REPLYlink written 19 months ago by Kevin Blighe56k

Thanks so much for the help! I was wondering if you have any recommendations concerning sample size (RNA-seq) for this to properly work?

ADD REPLYlink modified 19 months ago • written 19 months ago by JJ500

'The more, the merrier', basically! What sample size have you got, currently?

Regarding variables, you had mentioned stepwise regression in your other thread. From my experience, when the variables gets toward 40 or 50, one does then require lasso regression.

ADD REPLYlink modified 19 months ago • written 19 months ago by Kevin Blighe56k

Well, we are designing the experiment at the moment - having said that I am restricted by money and can't sequence too many samples. Hence I am wondering what a minimum and good sample size would be for training and test set...

concerning stepwise regression - I was thinking about preselecting 40-50 genes by testing the DEGs independently with cox proportional hazards regression and then pick the top 40-50 genes. I would then feed them into stepwise regression - does this make sense? Something similar was suggested in this thread.

Thank so much for your input!

ADD REPLYlink modified 19 months ago • written 19 months ago by JJ500

Yes, you could do that. I've done that in the past, i.e., differential expression analysis (or the Cox model in your case), followed by stepwise regression and model building of your top hits.

Sample sizes in RNA-seq / expression studies don't have to be as large as GWAS / genetic studies. Anecdotally, they say that you should have minimum 3 replicates per group, so, n=6 in a simple case-control study. Just do as many as you can but don't leave the groups unbalanced. That is, if you can afford 20 RNA-seq samples, then do 10 per group.

ADD REPLYlink written 19 months ago by Kevin Blighe56k

naive question what is "Mymodel" is it the input matrix ?

ADD REPLYlink written 20 months ago by krushnach80690

MyModel is neither a matrix nor data-frame. It is the regression model, created with either lm() or lm(). Hope that this helps!

ADD REPLYlink written 20 months ago by Kevin Blighe56k it helped me to get a hang of it i was seeing the code but unable to figure it out let me try and let you know and this is the first error i encounter actually im using your code which you made regression

so the model is listModels not sure if im doing it correct or not

Error in UseMethod("cooks.distance") : 
  no applicable method for 'cooks.distance' applied to an object of class "list"

figured out the error the object which i should give is model not list model..some of now

what is this "cv.lm(MyData[!$MyOutcomeTrait),], MyModel, m=3)" the MyOutcomeTrait

ADD REPLYlink modified 20 months ago • written 20 months ago by krushnach80690

You figured it out? - that's great, dear friend

ADD REPLYlink modified 20 months ago • written 20 months ago by Kevin Blighe56k

not fully but im moving at a snail's pace....i updated my doubt

ADD REPLYlink modified 20 months ago • written 20 months ago by krushnach80690

Okay. MyOutcomeTrait is the variable that you're modelling, i.e., the y variable.

For example:

lm(MyOutcomeTrait ~ gene1)
ADD REPLYlink written 20 months ago by Kevin Blighe56k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 846 users visited in the last hour