Question: Resources for gene signature creation
gravatar for mforde84
16 months ago by
mforde841.2k wrote:


I have a three groups with 20-30 samples each, and I'd like to generate predictive gene signatures for each. Do any of you have suggestions on R packages that you've used in the past and that I can look into to generate these types of signatures?

Kindly, Martin

signatures gene expression • 1.4k views
ADD COMMENTlink modified 16 months ago by Kevin Blighe35k • written 16 months ago by mforde841.2k
gravatar for Kevin Blighe
16 months ago by
Kevin Blighe35k
Republic of Ireland
Kevin Blighe35k wrote:

Edit: answer updated 20th July 2018

If you have already derived a panel of genes that are, I assume, statistically significantly differentially expressed, then there are various tools to use.

If your outcome is a binary trait (or has greater than 2 categories), then first build a binary or multinomial logistic regression model using glm() with your predictors (genes). You can then choose the best predictors based on the AIC or BIC using stepwise regression (stepAIC() from MASS package). After you have chosen your best predictors, you can perform ROC analysis using the pROC package. Other options for binary or traits >2 categories are lasso-penalized, elastic-net, and ridge regression (glmnet package). These may take a while to get the hang of but it is worth knowing how to use them properly.

If you are modelling a continuous trait (e.g. blood levels of inflammatory markers, protein expression, metabolite levels, etc.), then do the same as above but build a linear model using lm().

For example:

lm(MyContinuousTrait ~ gene1 + gene2 + gene3 + ... + geneX, data=MyData)

You can then do stepwise regression to refine the model further.

If you have many genes (e.g. > 30), then first test each gene independently and then build a combined model (like above) with the statistically significant genes:

lm(MyContinuousTrait ~ gene1, data=MyData) #not statistically significant
lm(MyContinuousTrait ~ gene2, data=MyData) #statistically significant
lm(MyContinuousTrait ~ gene3, data=MyData) #not statistically significant
lm(MyContinuousTrait ~ gene4, data=MyData) #statistically significant
lm(MyContinuousTrait ~ gene5, data=MyData) #statistically significant

MyModel <- lm(MyContinuousTrait ~ gene2 + gene4 + gene5, data=MyData)

With this final model, you can still do stepwise regression to refine it further before further testing.

With your final model, you need to check your models to ensure that they are actually robust.


#Check for outliers and influencial samples based on Cook's distances (2 different plots)
cutoff <- 4/((nrow(MyData)-length(MyModel$coefficients)-1)) 
plot(MyModel, which=4, cook.levels=cutoff)

#Bonferonni p-value for most extreme observations

#Check for non-normality via QQ plot for studentized residuals
qqPlot(MyModel, main="QQ Plot")

#Evaluate homoscedasticity
#Non-constant error variance test

#Multi-collinearity via variance inflation factors 
#VIF>2.5 indicates R-squared of a predictor to all other predictors of 0.6
sqrt(vif(MyModel)) > 2.5

#Non-linearity via component + residual plot

#Test for Autocorrelated Errors

#Global test of model assumptions
gvmodel <- gvlma(MyModel) 

#K-fold cross-validation (here 3-fold)
    cv.lm(MyData[!$MyOutcomeTrait),], MyModel, m=3)

#Assessing R^2 shrinkage using 10-Fold Cross-Validation 
library(bootstrap) <- function(x,y){lsfit(x,y)}
theta.predict <- function(fit,x){cbind(1,x)%*%fit$coef}
x <- as.matrix(MyData[,which(colnames(MyData) %in% names(summary(MyModel)$coefficients[,1]))])
y <- as.matrix(MyData[,c("MyOutcomeTrait")]) 
results <- crossval(x, y,, theta.predict, ngroup=10)
#Raw R2 
cor(y, MyModel$fitted.values)**2
#Cross-validated R2
cor(y, results$**2

Finally, if you have built a linear model, you can make a nice plot with ggplot2 using the following code:

par(cex=0.8, mar=c(5,5,5,5))
MyData$MyContinuousTrait.pred = predict(MyModel)
p1 <- ggplot(MyData, aes(x=MyContinuousTrait, y=MyContinuousTrait.pred)) +
geom_point() + xlim(0,5) + ylim(0,5) +
geom_smooth(method='lm',formula=y~x) +
stat_smooth(method="lm", fullrange=TRUE) +  #Extend the fitted line for points where no values exist (such as 0 or 5)
ggtitle("My linear model")

By working with this code a fair bit, you can produce nice graphics like this: Captura_de_tela_de_2017_09_20_18_01_38

In the plot, x-axis designates the real values for the outcome variable, whilst the y-axis indicate the model predicted values.

ADD COMMENTlink modified 6 months ago • written 16 months ago by Kevin Blighe35k

Thanks for the suggestions, Kevin.

It's a categorical factor. We integrated mRNA and and miRNA data with SNFtool to generate three clusters that show DEG, pathway, and survival differences. However for validation purposes, it's difficult to find independent datasets in the wild which are of our specific tissue type and which have mRNA, miRNA, and survival data. So to work around this, they want to generate mRNA signatures for each and perform single sample GSVA on an independent dataset and determine if enrichment scores correlate with outcome.

For the glm, and excuse me if this is a silly question, are you suggesting something like a multivariate regression model (e.g., gene_1 + gene_n <- factor), or multiple regressions (e.g., gene_1 <- factor, gene_2 <- factor)? I assume this is already addressed within the stepAIC() documentation, maybe?

ADD REPLYlink modified 16 months ago • written 16 months ago by mforde841.2k

Hey, yes, I was suggesting a multivariate model such as glm(factor ~ gene1 + gene2 + gene3 + gene4 + ... geneN). With stepAIC you first define your full model, minimal model (which can be just glm(factor ~ 1)) and then run either a forward, backward, or both stepwise regression. This then returns a final model that it deems to be most predictive of your factor of interest, based on either the AIC or BIC criteria. It may return just a few of your gene - who knows. Based on the fact that your genes are already DEGs, I would hope that your final model was pretty good. A ROC curve using the pROC package should also show this with a high AUC.

Also, stepwise regression is not great for very large numbers of variables. If you have around 20 or 30 DEGs, then that's optimal I believe. Also, some statisticians will probably frown if you just let the stepwise function do everything for you. You should check the model summaries with the summary() function.

Also, now that I know that it's a multinomial logistic regression model, the QC on the model differs. Here is some further code that may assist in assessing the model (with my own comments from my own code)

#While no exact equivalent to the R2 of linear regression exists, the McFadden R2 index can be used to assess the model fit.

#Get ORs and confidence intervals
exp(cbind(OR=coef(MyModel), confint(MyModel)))

#Get Chi-squared ANOVA P values between your groups
anova(MyModel, test="Chisq")

#Perform cross validation (CV) analysis
#The delta values should not greatly differ
#K=number of samples, i.e., leave-one-out CV.
cv.glm(MyData, MyModel, K=nrow(MyData))$delta

Obviously your choice of factor reference level is key too.

Trust that this assists you.

ADD REPLYlink modified 16 months ago • written 16 months ago by Kevin Blighe35k

Oh ok, I see. This makes a lot of sense.

Unfortunately the number of DEG is high (~2k unique across all three groups). They are tumor samples, so heterogeneity is expected. At the very least it makes sense to go back and double check the #DEG I have is accurate, and contrast the limma results with an alternate method like DESeq. I could also take the top 10 ranked DEG by logFC in each SNF group, or apply more conservative limits on logFC and/or adjPvalue.

Worse case scenario, assuming I end up with a large number of possible predictors, would it be helpful to do a random forest instead of the stepwise regression and address potential over-fitting by limiting the max number of randomly selected predictors (e.g., 20-30) included in each random decision tree? The issue I see with this is that the number of permutations I'd have to run to cover possible combinations of predictors would be astronomical. Maybe instead of doing this, I could take the DEG and do a cox regression for each and select top 30 with largest absolute hazard ratios? Ultimately what we are most interested in is whats driving the survival differences, so this seems to be a logical data driven approach to reduce the number of potential genes.

Again thanks for your thoughts on this, this is very helpful for me.

ADD REPLYlink modified 16 months ago • written 16 months ago by mforde841.2k

You seem to have a good grasp of the options available already. Stepwise regression is definitely not suitable with that many variables; however, lasso-penalized regression would work and may give you only a couple dozen variables in return. With the lasso regression (alpha=1.0), you can additionally adjust the alpha 'penalty' parameter in order to change the number of predictors that are thrown out (and thus the number that remain). Alpha of 0 is ridge regression, where no variables are thrown out of the model and each is assigned a beta value; whereas anything in between 1.0 and 0.0 is elastic regression.

In light of the fact that it's cancer data and that your interest is in survival, the cox proportional hazards regression idea is good and I assume you meant that you'd test each independently and then pick the top 30? That's certainly another acceptable way to do this type of thing.

Random Forests, I cannot comment too much - it's one area that I'm yet to cover. I had an American colleague in Boston who did some great work on those but I did not get a chance to learn from his ideas/efforts.

ADD REPLYlink written 16 months ago by Kevin Blighe35k

Great suggestion. I think I'm going to try the lasso regression and see what I pull out.

ADD REPLYlink written 16 months ago by mforde841.2k

Hi, I was wondering how you implemented this in the end? I am faced with a similar task at the moment and I am wondering how to do this properly. Thanks!

ADD REPLYlink written 4 months ago by JJ390

You mean lasso regression? I share some code/ideas here:

There are also tutorials online, if you search via a search engine.

ADD REPLYlink written 4 months ago by Kevin Blighe35k

Thanks so much for the help! I was wondering if you have any recommendations concerning sample size (RNA-seq) for this to properly work?

ADD REPLYlink modified 4 months ago • written 4 months ago by JJ390

'The more, the merrier', basically! What sample size have you got, currently?

Regarding variables, you had mentioned stepwise regression in your other thread. From my experience, when the variables gets toward 40 or 50, one does then require lasso regression.

ADD REPLYlink modified 4 months ago • written 4 months ago by Kevin Blighe35k

Well, we are designing the experiment at the moment - having said that I am restricted by money and can't sequence too many samples. Hence I am wondering what a minimum and good sample size would be for training and test set...

concerning stepwise regression - I was thinking about preselecting 40-50 genes by testing the DEGs independently with cox proportional hazards regression and then pick the top 40-50 genes. I would then feed them into stepwise regression - does this make sense? Something similar was suggested in this thread.

Thank so much for your input!

ADD REPLYlink modified 4 months ago • written 4 months ago by JJ390

Yes, you could do that. I've done that in the past, i.e., differential expression analysis (or the Cox model in your case), followed by stepwise regression and model building of your top hits.

Sample sizes in RNA-seq / expression studies don't have to be as large as GWAS / genetic studies. Anecdotally, they say that you should have minimum 3 replicates per group, so, n=6 in a simple case-control study. Just do as many as you can but don't leave the groups unbalanced. That is, if you can afford 20 RNA-seq samples, then do 10 per group.

ADD REPLYlink written 4 months ago by Kevin Blighe35k

naive question what is "Mymodel" is it the input matrix ?

ADD REPLYlink written 6 months ago by krushnach80440

MyModel is neither a matrix nor data-frame. It is the regression model, created with either lm() or lm(). Hope that this helps!

ADD REPLYlink written 6 months ago by Kevin Blighe35k it helped me to get a hang of it i was seeing the code but unable to figure it out let me try and let you know and this is the first error i encounter actually im using your code which you made regression

so the model is listModels not sure if im doing it correct or not

Error in UseMethod("cooks.distance") : 
  no applicable method for 'cooks.distance' applied to an object of class "list"

figured out the error the object which i should give is model not list model..some of now

what is this "cv.lm(MyData[!$MyOutcomeTrait),], MyModel, m=3)" the MyOutcomeTrait

ADD REPLYlink modified 6 months ago • written 6 months ago by krushnach80440

You figured it out? - that's great, dear friend

ADD REPLYlink modified 6 months ago • written 6 months ago by Kevin Blighe35k

not fully but im moving at a snail's pace....i updated my doubt

ADD REPLYlink modified 6 months ago • written 6 months ago by krushnach80440

Okay. MyOutcomeTrait is the variable that you're modelling, i.e., the y variable.

For example:

lm(MyOutcomeTrait ~ gene1)
ADD REPLYlink written 6 months ago by Kevin Blighe35k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1803 users visited in the last hour