**Edit March 5th, 2020: there is no definitive answer to the OP's question. My original answer (below) is to merely foment ideas on how to address the question.**

Hi Jack,

In order to achieve this, you will have to model your expression data using the PAM50 classifier genes, and then allow the modelling to decide which genes can be best used to predict the breast cancer sub-types.

*Keep in mind that PAM50 is already a relatively small classifier that can segregate the different sub-types.*

You could aim to do multinomial logistic regression because your aim is to use gene expression to predict multiple categorical variables (sub-types). I would also recommend trying out lasso-penalised regression, which is what I go over in this answer (below).

You'll require:

- A gene expression dataset with PAM50 genes filtered in (if you don't know the genes, look here: Where To Download Pam50 Gene Set? )
- Breast cancer subtypes for samples

# Data setup

```
require(glmnet)
# Get your data like this:
modellingLasso
SubType MYC EGFR KIF2C CDC20
Sample 1 LuminalA 11.39 10.62 9.75 10.34
Sample 2 LuminalB 10.16 8.63 8.68 9.08
Sample 3 LuminalA 9.29 10.24 9.89 10.11
Sample 4 Her2 11.53 9.22 9.35 9.13
Sample 5 Her2 8.35 10.62 10.25 10.01
Sample 6 Her2 11.71 10.43 8.87 9.44
...
```

The `SubType`

variable can be a factor:

```
modellingLasso$SubType
...
Levels: Her2 LuminalA LuminalB ...
```

Now we can build the models:

# build a single pass (single fold) lasso-penalised model

```
lassoModel <- glmnet(
x=data.matrix(modellingLasso[,-1]),
y=modellingLasso$SubType,
standardize=TRUE,
alpha=1.0,
family="multinomial")
plot(lassoModel, xvar="lambda")
```

# perform 10-fold cross validation

```
cv.lassoModel <- cv.glmnet(
x=data.matrix(modellingLasso[,-1]),
y=modellingLasso$SubType,
standardize=TRUE,
alpha=1.0,
nfolds=10,
family="multinomial",
parallel=TRUE)
# plot variable deviances vs. shrinkage parameter, λ (lambda)
plot(cv.lassoModel)
```

# identify best predictors

Now we can identify best predictors via 2 metrics: min λ (lamba); 1 standard error of λ

```
idealLambda <- cv.lassoModel$lambda.min
idealLambda1se <- cv.lassoModel$lambda.1se
print(idealLambda); print(idealLambda1se)
# derive coefficients for each gene
co <- coef(cv.lassoModel, s=idealLambda, exact=TRUE)
co
co.se <- coef(cv.lassoModel, s=idealLambda1se, exact=TRUE)
co.se
# identify predictors for each sub-type
rownames(co.se$LuminalA)[which(co.se$LuminalA != 0)]
rownames(co.se$LuminalB)[which(co.se$LuminalB != 0)]
rownames(co.se$Her2)[which(co.se$Her2 != 0)]
```

# Downstream

For downstream applications, we could build a new model with our best predictors:

```
finalLasso <- glm(modellingLasso$SubType ~ MYC + ERBB2 + CCND1,
data=modellingLasso,
family=binomial(link="logit"))
summary(finalLasso)
```

We can also perform ROC analysis:

```
# ROC analysis
require(pROC)
roc <- roc(modellingLasso$SubType, fitted(finalLasso), smooth=FALSE)
plot.roc(
roc,
grid=TRUE,
grid.lwd=2,
col="royalblue",
main="Lasso model")
text(0.3, 0.45,
col="royalblue",
paste("AUC (L95, AUC, U95):",
paste(round(ci.auc(roc)[1:3], 3), sep=" ", collapse=","), sep=" "),
cex=0.7)
```

# Further info:

If you run into difficulty with the lasso-penalised regression, you can modify the stringency by relaxing α (alpha):

```
# elastic-net regression (here α=0.5)
cv.ElasticModel <- cv.glmnet(..., alpha=0.5, ...)
# ridge regression (α=0)
cv.RidgeModel <- cv.glmnet(..., alpha=0.00, ...)
```

Also see Professor Hastie's own in-depth tutorial: Glmnet Vignette

Hi Kevin, I have read your comment on this page and I also have a batch of RNA-seq count to predict survival When I see your data 'modellingLasso' I have two questions: 1.how to define the endpoint: if there was two patients, one have survived for 3 years and then died, the other survived for 1 year and is still alive and under follow-up. Then the first one should be died and the other should be alive?but how can we know the later one could survived longer than the first one? Or should we define a survival duration,namely 2 year. and then the first one is 2 year positive and the later one is 2 year negative?

2.does the count matric been normalized and log transformation? which method do you adapted? You use the PAM50 genes and could I use my own datasets with 500+ DEGs?

Hello, if you only have 2 patients, then it is not a great situation to have. However, if you have multiple biopsies of the same patients, then it may work - do you have multiple biopsies at different time-points?; or, do you literally just have 2 samples, both taken at the same time-point?

Yes, you can use any number of genes with the lasso regression. The method has previously been used with genome-wide genetic variant data in breast cancer.

Hello,kevin, thank you for your reply. but for the question 1,I just gave an example ,actually ,I have 300+patients ,and each one do not have biopsy replication. I am just wondering how could I define the endpoints?

For question 2,I'd like to ask that if the expression matrix of my own DEGs should be normalized and log transformation? There is another question,if the DEGs I choose from have high correlations between themselves ,could I choose this method of LASSO? Ridge or plasticnet is better? And I have 150 patients ,there is 1000-6000(varing with my chosen standard log2fc,padj or pvalue) DEGs as input,does this situation work?

Hello!

In that case, I would recommend multiple end-points (and, therefore, multiple analyses). Your end-points can be something like:

These will obviously be encoded as 0 and 1

## --------------------------------

The data that you use should already be normalised. If you have microarray data, then it will already be on Log2 scale.

## -------------------------------------

The cross-validation part of LASSO regression is partly aimed to mitigate (reduce) the effects of inter-correlated variables. So, be sure that you do cross-validation.

LASSO will fail if it cannot determine differences between your end-points, in which case a less severe regression penalty (like ridge or elastic-net) can be done.

Hi,

Thanks a lot, this has been truly useful.

Notwithstanding, I have a theoretical question.

Why is SubType the response variable? Is not suppose that variable is a factor (explanatory variable) and the expression values the response variable?

Thanks for your collaboration.

Best regards,

Juan

It depends on your point of view and the question that you are asking. I think that both ways are valid.

Thanks for your help.

I have an expression matrix as I show you n the next table.

I have no information related to this matrix, but I have to create a predictive model. As I know, probably the matrix was created using the "voom" function, which gives negative measures. Then, it is possible to do the model just like the before example.

Note: The Y variable indicates whether a patient was diagnosed with the disease (y=1) or not (y=0).

I truly appreciate your comments.

Best regards,

Juan

Yes, you can still use the same code for this type of data. What is the distribution of your data, though? You can check with the

`hist()`

function (be sure to not include the`y`

variable when you generate the histogram via`hist()`

)Hi.

I shared the histogram and the overlap normal curve:

Thanks for your help.

Best regards,

Juan

Hola de nuevo. No puedo ver la figura. Se puede encontrar información para cargar una figura aquí: How to add images to a Biostars post

Sorry.

Here a new attempt.

Looks like great data! It seems suitable for the default functionality of

`cv.glmnet`

, which expects data to be normally distributed.Hi Kevin, Thanks for your Help, It's such a great resource. I have built model LASSO for my data set, I am grateful for your tutorials. Thanks again! Secondly, when I got the best predictors I am not able to print it out it's saying (dgCMatrix) Sparse matrix error. I did use as.matrix() to print, but it's of no use. Please help me to save the predictors in a .csv file. Thanks again, Sincerely, Dave

Hey Dave, it is possible that some of the functionality of these commands has changed since I posted this answer (above). Which command, specifically, are you trying to run?

Hi Kevin, Sorry for the delayed reply, my Lab is in remotest part and to use the Internet I have to travel back to the city. Sorry again! I am using the following command

Thanks again! Dave

Hey, why do you set alpha = 2.0? Usually it is between 0.0 and 1.0

Okay, so, I will hear from you again when you are next in the city. Do you not have a mobile / cellular network?

Hi Kevin, I have expression profile of the list of genes (say ~100 genes) and i want to select best features out of total list. I have came across this link and tried executing the code with glmnet. However, i am getting error at the step where i want identify best predictors for each class (binomial categories; pCR and RD) the below code giving i am using

modellingLasso$Response

lassoModel <- glmnet( x=data.matrix(modellingLasso[,-1]), y=modellingLasso$Response, standardize=TRUE, alpha=0.5, family="binomial") plot(lassoModel, xvar="lambda") print(lassoModel)

cv.lassoModel <- cv.glmnet( x=data.matrix(modellingLasso[,-1]), y=modellingLasso$Response, nfolds=10, family="binomial")

## plot variable deviances vs. shrinkage parameter, λ (lambda)

plot(cv.lassoModel) print(cv.lassoModel)

idealLambda <- cv.lassoModel$lambda.min

idealLambda1se <- cv.lassoModel$lambda.1se

print(idealLambda); print(idealLambda1se)

## derive coefficients for each gene

co <- coef(cv.lassoModel, s=idealLambda, exact=TRUE) co

co.se <- coef(cv.lassoModel, s=idealLambda1se, exact=TRUE) co.se

## identify predictors for each sub-type

(rownamesco.se$pCR)[whichco.se$pCR != 0]

Error: object 'rownamesco.se' not foundHi, you probably need

Hi Kevin,

Thanks for your prompt response, I tried running that earlier also, but getting error

**rownames(co.se$pCR)[whichco.se$pCR != 0]Error in co.se$pCR : $ operator not defined for this S4 class**Then i searched S4 class error and found that i could not access the slots of an S4 object using $ operator. Instead it was suggested to use @ operator. I tried with @ operotor as well, but getting new error

rownames(co.se@pCR)[whichco.se@pCR != 0]Error in rownames(co.se@pCR) : no slot of name "pCR" for this object of class "dgCMatrix"I am not how to fix this.

Best

Inayat

Hi Inayat. I think that there is an issue with rendering of my code in my post (above). The commands shoudl be:

Please note the locations of the parentheses,

`(`

Thanks Kevin, I will try this

Best Inayat

Hi Dr. Kevin, I selected 40 prognostic genes having significant survival difference (survival R package) using univariate cox regression that had 25,000 genes. I want to narrow down further genes, using Lasso regression. By using the above method, shall I use the survival time (alive and dead) in y variable as a factor and keep x as expression data. Is it fine? I am a bit confused how to add survival information in y column.

And do I need to use the entire RNA-seq data or only use 40 genes data for lasso regression. Will an expression dataset of 40 genes would be fine. My goal is to build a prognostic model. Will really appreciate your comment.