Question: Resources for gene signature creation
1
gravatar for mforde84
11 months ago by
mforde841.2k
mforde841.2k wrote:

Hi,

I have a three groups with 20-30 samples each, and I'd like to generate predictive gene signatures for each. Do any of you have suggestions on R packages that you've used in the past and that I can look into to generate these types of signatures?

Kindly, Martin

signatures gene expression • 741 views
ADD COMMENTlink modified 11 months ago by Kevin Blighe25k • written 11 months ago by mforde841.2k
10
gravatar for Kevin Blighe
11 months ago by
Kevin Blighe25k
USA / Europe / Brazil
Kevin Blighe25k wrote:

Edit: answer updated 20th July 2018

If you have already derived a panel of genes that are, I assume, statistically significantly differentially expressed, then there are various tools to use.

If your outcome is a binary trait (or has greater than 2 categories), then first build a binary or multinomial logistic regression model using glm() with your predictors (genes). You can then choose the best predictors based on the AIC or BIC using stepwise regression (stepAIC() from MASS package). After you have chosen your best predictors, you can perform ROC analysis using the pROC package. Other options for binary or traits >2 categories are lasso-penalized, elastic-net, and ridge regression (glmnet package). These may take a while to get the hang of but it is worth knowing how to use them properly.

If you are modelling a continuous trait (e.g. blood levels of inflammatory markers, protein expression, metabolite levels, etc.), then do the same as above but build a linear model using lm().

For example:

lm(MyContinuousTrait ~ gene1 + gene2 + gene3 + ... + geneX, data=MyData)

You can then do stepwise regression to refine the model further.

If you have many genes (e.g. > 30), then first test each gene independently and then build a combined model (like above) with the statistically significant genes:

lm(MyContinuousTrait ~ gene1, data=MyData) #not statistically significant
lm(MyContinuousTrait ~ gene2, data=MyData) #statistically significant
lm(MyContinuousTrait ~ gene3, data=MyData) #not statistically significant
lm(MyContinuousTrait ~ gene4, data=MyData) #statistically significant
lm(MyContinuousTrait ~ gene5, data=MyData) #statistically significant

MyModel <- lm(MyContinuousTrait ~ gene2 + gene4 + gene5, data=MyData)

With this final model, you can still do stepwise regression to refine it further before further testing.

With your final model, you need to check your models to ensure that they are actually robust.

library(car)

#Check for outliers and influencial samples based on Cook's distances (2 different plots)
plot(cooks.distance(MyModel))
cutoff <- 4/((nrow(MyData)-length(MyModel$coefficients)-1)) 
plot(MyModel, which=4, cook.levels=cutoff)

#Bonferonni p-value for most extreme observations
outlierTest(MyModel)

#Check for non-normality via QQ plot for studentized residuals
qqPlot(MyModel, main="QQ Plot")

#Evaluate homoscedasticity
#Non-constant error variance test
ncvTest(MyModel)

#Multi-collinearity via variance inflation factors 
vif(MyModel)
#VIF>2.5 indicates R-squared of a predictor to all other predictors of 0.6
sqrt(vif(MyModel)) > 2.5

#Non-linearity via component + residual plot
crPlots(MyModel)

#Test for Autocorrelated Errors
durbinWatsonTest(MyModel)

#Global test of model assumptions
library(gvlma)
gvmodel <- gvlma(MyModel) 
summary(gvmodel)

#K-fold cross-validation (here 3-fold)
library(DAAG)
pdf("MyModel.CrossValidated.pdf")
    cv.lm(MyData[!is.na(MyData$MyOutcomeTrait),], MyModel, m=3)
dev.off()

#Assessing R^2 shrinkage using 10-Fold Cross-Validation 
library(bootstrap)
theta.fit <- function(x,y){lsfit(x,y)}
theta.predict <- function(fit,x){cbind(1,x)%*%fit$coef}
x <- as.matrix(MyData[,which(colnames(MyData) %in% names(summary(MyModel)$coefficients[,1]))])
y <- as.matrix(MyData[,c("MyOutcomeTrait")]) 
results <- crossval(x, y, theta.fit, theta.predict, ngroup=10)
#Raw R2 
cor(y, MyModel$fitted.values)**2
#Cross-validated R2
cor(y, results$cv.fit)**2

Finally, if you have built a linear model, you can make a nice plot with ggplot2 using the following code:

par(cex=0.8, mar=c(5,5,5,5))
MyData$MyContinuousTrait.pred = predict(MyModel)
p1 <- ggplot(MyData, aes(x=MyContinuousTrait, y=MyContinuousTrait.pred)) +
geom_point() + xlim(0,5) + ylim(0,5) +
geom_smooth(method='lm',formula=y~x) +
stat_smooth(method="lm", fullrange=TRUE) +  #Extend the fitted line for points where no values exist (such as 0 or 5)
ggtitle("My linear model")
plot(p1)

By working with this code a fair bit, you can produce nice graphics like this: Captura_de_tela_de_2017_09_20_18_01_38

In the plot, x-axis designates the real values for the outcome variable, whilst the y-axis indicate the model predicted values.

ADD COMMENTlink modified 28 days ago • written 11 months ago by Kevin Blighe25k

Thanks for the suggestions, Kevin.

It's a categorical factor. We integrated mRNA and and miRNA data with SNFtool to generate three clusters that show DEG, pathway, and survival differences. However for validation purposes, it's difficult to find independent datasets in the wild which are of our specific tissue type and which have mRNA, miRNA, and survival data. So to work around this, they want to generate mRNA signatures for each and perform single sample GSVA on an independent dataset and determine if enrichment scores correlate with outcome.

For the glm, and excuse me if this is a silly question, are you suggesting something like a multivariate regression model (e.g., gene_1 + gene_n <- factor), or multiple regressions (e.g., gene_1 <- factor, gene_2 <- factor)? I assume this is already addressed within the stepAIC() documentation, maybe?

ADD REPLYlink modified 11 months ago • written 11 months ago by mforde841.2k
2

Hey, yes, I was suggesting a multivariate model such as glm(factor ~ gene1 + gene2 + gene3 + gene4 + ... geneN). With stepAIC you first define your full model, minimal model (which can be just glm(factor ~ 1)) and then run either a forward, backward, or both stepwise regression. This then returns a final model that it deems to be most predictive of your factor of interest, based on either the AIC or BIC criteria. It may return just a few of your gene - who knows. Based on the fact that your genes are already DEGs, I would hope that your final model was pretty good. A ROC curve using the pROC package should also show this with a high AUC.

Also, stepwise regression is not great for very large numbers of variables. If you have around 20 or 30 DEGs, then that's optimal I believe. Also, some statisticians will probably frown if you just let the stepwise function do everything for you. You should check the model summaries with the summary() function.

Also, now that I know that it's a multinomial logistic regression model, the QC on the model differs. Here is some further code that may assist in assessing the model (with my own comments from my own code)

#While no exact equivalent to the R2 of linear regression exists, the McFadden R2 index can be used to assess the model fit.
library(pscl)
pR2(MyModel)

#Get ORs and confidence intervals
exp(cbind(OR=coef(MyModel), confint(MyModel)))

#Get Chi-squared ANOVA P values between your groups
anova(MyModel, test="Chisq")

#Perform cross validation (CV) analysis
#The delta values should not greatly differ
#K=number of samples, i.e., leave-one-out CV.
library(boot)
cv.glm(MyData, MyModel, K=nrow(MyData))$delta

Obviously your choice of factor reference level is key too.

Trust that this assists you.

ADD REPLYlink modified 11 months ago • written 11 months ago by Kevin Blighe25k
1

Oh ok, I see. This makes a lot of sense.

Unfortunately the number of DEG is high (~2k unique across all three groups). They are tumor samples, so heterogeneity is expected. At the very least it makes sense to go back and double check the #DEG I have is accurate, and contrast the limma results with an alternate method like DESeq. I could also take the top 10 ranked DEG by logFC in each SNF group, or apply more conservative limits on logFC and/or adjPvalue.

Worse case scenario, assuming I end up with a large number of possible predictors, would it be helpful to do a random forest instead of the stepwise regression and address potential over-fitting by limiting the max number of randomly selected predictors (e.g., 20-30) included in each random decision tree? The issue I see with this is that the number of permutations I'd have to run to cover possible combinations of predictors would be astronomical. Maybe instead of doing this, I could take the DEG and do a cox regression for each and select top 30 with largest absolute hazard ratios? Ultimately what we are most interested in is whats driving the survival differences, so this seems to be a logical data driven approach to reduce the number of potential genes.

Again thanks for your thoughts on this, this is very helpful for me.

ADD REPLYlink modified 11 months ago • written 11 months ago by mforde841.2k
3

You seem to have a good grasp of the options available already. Stepwise regression is definitely not suitable with that many variables; however, lasso-penalized regression would work and may give you only a couple dozen variables in return. With the lasso regression (alpha=1.0), you can additionally adjust the alpha 'penalty' parameter in order to change the number of predictors that are thrown out (and thus the number that remain). Alpha of 0 is ridge regression, where no variables are thrown out of the model and each is assigned a beta value; whereas anything in between 1.0 and 0.0 is elastic regression.

In light of the fact that it's cancer data and that your interest is in survival, the cox proportional hazards regression idea is good and I assume you meant that you'd test each independently and then pick the top 30? That's certainly another acceptable way to do this type of thing.

Random Forests, I cannot comment too much - it's one area that I'm yet to cover. I had an American colleague in Boston who did some great work on those but I did not get a chance to learn from his ideas/efforts.

ADD REPLYlink written 11 months ago by Kevin Blighe25k
1

Great suggestion. I think I'm going to try the lasso regression and see what I pull out.

ADD REPLYlink written 11 months ago by mforde841.2k

naive question what is "Mymodel" is it the input matrix ?

ADD REPLYlink written 29 days ago by krushnach80320
1

MyModel is neither a matrix nor data-frame. It is the regression model, created with either lm() or lm(). Hope that this helps!

ADD REPLYlink written 29 days ago by Kevin Blighe25k

okay..now it helped me to get a hang of it i was seeing the code but unable to figure it out let me try and let you know and this is the first error i encounter actually im using your code which you made regression

so the model is listModels not sure if im doing it correct or not

Error in UseMethod("cooks.distance") : 
  no applicable method for 'cooks.distance' applied to an object of class "list"

figured out the error the object which i should give is model not list model..some success..as of now

what is this "cv.lm(MyData[!is.na(MyData$MyOutcomeTrait),], MyModel, m=3)" the MyOutcomeTrait

ADD REPLYlink modified 29 days ago • written 29 days ago by krushnach80320

You figured it out? - that's great, dear friend

ADD REPLYlink modified 29 days ago • written 29 days ago by Kevin Blighe25k

not fully but im moving at a snail's pace....i updated my doubt

ADD REPLYlink modified 29 days ago • written 29 days ago by krushnach80320

Okay. MyOutcomeTrait is the variable that you're modelling, i.e., the y variable.

For example:

lm(MyOutcomeTrait ~ gene1)
ADD REPLYlink written 28 days ago by Kevin Blighe25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 580 users visited in the last hour