Hi, Probably you all know about PAM50 gene signature for breast cancer sub typing. What I'm looking for is the some of small sets of the genes that their expression levels(expressed or not) helps me to exclude some of the breast cancer subtypes (For example, by looking at gene A and gene B expression levels I can tell something like: this can not be Her2) Does anyone have idea which genes are informative in this regard ?
In order to achieve this, you will have to model your expression data using the PAM50 classifier genes, and then allow the modelling to decide which genes can be best used to predict the breast cancer sub-types. Keep in mind that PAM50 is already a relatively small classifier that can statistically segregate the different sub-types.
You could aim to do multinomial logistic regression because your aim is to use gene expression to predict multiple categorical variables (sub-types). I posted some short code for this, here: Resources for gene signature creation
I would also recommend trying out lasso-penalised regression (α=1), which is what I go over in this answer (below).
- A gene expression dataset with PAM50 genes filtered in (if you don't know the genes, look here: Where To Download Pam50 Gene Set? )
- Breast cancer subtypes for samples
require(glmnet) #Get your data like this: modellingLasso SubType MYC EGFR KIF2C CDC20 Sample 1 LuminalA 11.39 10.62 9.75 10.34 Sample 2 LuminalB 10.16 8.63 8.68 9.08 Sample 3 LuminalA 9.29 10.24 9.89 10.11 Sample 4 Her2 11.53 9.22 9.35 9.13 Sample 5 Her2 8.35 10.62 10.25 10.01 Sample 6 Her2 11.71 10.43 8.87 9.44 ...
SubType variable can be a factor:
modellingLasso$SubType ... Levels: Her2 LuminalA LuminalB ...
Now we can build the models:
build a single pass (single fold) lasso-penalised model
lassoModel <- glmnet( x=data.matrix(modellingLasso[,-1]), y=modellingLasso$SubType, standardize=TRUE, alpha=1.0, family="multinomial") plot(lassoModel, xvar="lambda")
perform 10-fold cross validation
cv.lassoModel <- cv.glmnet( x=data.matrix(modellingLasso[,-1]), y=modellingLasso$SubType, standardize=TRUE, alpha=1.0, nfolds=10, family="multinomial", parallel=TRUE) # plot variable deviances vs. shrinkage parameter, λ (lambda) plot(cv.lassoModel)
identify best predictors
Now we can identify best predictors via 2 metrics: min λ (lamba); 1 standard error of λ
idealLambda <- cv.lassoModel$lambda.min idealLambda1se <- cv.lassoModel$lambda.1se print(idealLambda); print(idealLambda1se) # derive coefficients for each gene co <- coef(cv.lassoModel, s=idealLambda, exact=TRUE) co co.se <- coef(cv.lassoModel, s=idealLambda1se, exact=TRUE) co.se
The best predictors will generally be those that have the smallest (possibly zero) coefficient values
For downstream applications, we could build a new model with our best predictors:
finalLasso <- glm(modellingLasso$SubType ~ MYC + ERBB2 + CCND1, data=modellingLasso, family=binomial(link="logit")) summary(finalLasso)
We can also perform ROC analysis:
# ROC analysis require(pROC) roc <- roc(modellingLasso$SubType, fitted(finalLasso), smooth=FALSE) plot.roc( roc, grid=TRUE, grid.lwd=2, col="royalblue", main="Lasso model") text(0.3, 0.45, col="royalblue", paste("AUC (L95, AUC, U95):", paste(round(ci.auc(roc)[1:3], 3), sep=" ", collapse=","), sep=" "), cex=0.7)
If you run into difficulty with the lasso-penalised regression, you can modify the stringency by relaxing α (alpha):
# elastic-net regression (here α=0.5) cv.ElasticModel <- cv.glmnet(..., alpha=0.5, ...) # ridge regression (α=0) cv.RidgeModel <- cv.glmnet(..., alpha=0.00, ...)
Also see Professor Hastie's own in-depth tutorial: Glmnet Vignette