Hi Jack,

In order to achieve this, you will have to model your expression data using the PAM50 classifier genes, and then allow the modelling to decide which genes can be best used to predict the breast cancer sub-types. **Keep in mind that PAM50 is already a relatively small classifier that can statistically segregate the different sub-types.**

You could aim to do multinomial logistic regression because your aim is to use gene expression to predict multiple categorical variables (sub-types). I posted some short code for this, here: Resources for gene signature creation

I would also recommend trying out lasso-penalised regression (α=1), which is what I go over in this answer (below).

You'll require:

- A gene expression dataset with PAM50 genes filtered in (if you don't know the genes, look here: Where To Download Pam50 Gene Set? )
- Breast cancer subtypes for samples

## -------------------------------

# Data setup

```
require(glmnet)
#Get your data like this:
modellingLasso
SubType MYC EGFR KIF2C CDC20
Sample 1 LuminalA 11.39 10.62 9.75 10.34
Sample 2 LuminalB 10.16 8.63 8.68 9.08
Sample 3 LuminalA 9.29 10.24 9.89 10.11
Sample 4 Her2 11.53 9.22 9.35 9.13
Sample 5 Her2 8.35 10.62 10.25 10.01
Sample 6 Her2 11.71 10.43 8.87 9.44
...
```

The `SubType`

variable can be a factor:

```
modellingLasso$SubType
...
Levels: Her2 LuminalA LuminalB ...
```

Now we can build the models:

# build a single pass (single fold) lasso-penalised model

```
lassoModel <- glmnet(
x=data.matrix(modellingLasso[,-1]),
y=modellingLasso$SubType,
standardize=TRUE,
alpha=1.0,
family="multinomial")
plot(lassoModel, xvar="lambda")
```

# perform 10-fold cross validation

```
cv.lassoModel <- cv.glmnet(
x=data.matrix(modellingLasso[,-1]),
y=modellingLasso$SubType,
standardize=TRUE,
alpha=1.0,
nfolds=10,
family="multinomial",
parallel=TRUE)
# plot variable deviances vs. shrinkage parameter, λ (lambda)
plot(cv.lassoModel)
```

# identify best predictors

Now we can identify best predictors via 2 metrics: min λ (lamba); 1 standard error of λ

```
idealLambda <- cv.lassoModel$lambda.min
idealLambda1se <- cv.lassoModel$lambda.1se
print(idealLambda); print(idealLambda1se)
# derive coefficients for each gene
co <- coef(cv.lassoModel, s=idealLambda, exact=TRUE)
co
co.se <- coef(cv.lassoModel, s=idealLambda1se, exact=TRUE)
co.se
```

The best predictors will generally be those that have the smallest (possibly zero) coefficient values

# Downstream

For downstream applications, we could build a new model with our best predictors:

```
finalLasso <- glm(modellingLasso$SubType ~ MYC + ERBB2 + CCND1,
data=modellingLasso,
family=binomial(link="logit"))
summary(finalLasso)
```

We can also perform ROC analysis:

```
# ROC analysis
require(pROC)
roc <- roc(modellingLasso$SubType, fitted(finalLasso), smooth=FALSE)
plot.roc(
roc,
grid=TRUE,
grid.lwd=2,
col="royalblue",
main="Lasso model")
text(0.3, 0.45,
col="royalblue",
paste("AUC (L95, AUC, U95):",
paste(round(ci.auc(roc)[1:3], 3), sep=" ", collapse=","), sep=" "),
cex=0.7)
```

## ---------------------------

# Further info:

If you run into difficulty with the lasso-penalised regression, you can modify the stringency by relaxing α (alpha):

```
# elastic-net regression (here α=0.5)
cv.ElasticModel <- cv.glmnet(..., alpha=0.5, ...)
# ridge regression (α=0)
cv.RidgeModel <- cv.glmnet(..., alpha=0.00, ...)
```

Also see Professor Hastie's own in-depth tutorial: Glmnet Vignette