Question: How to exclude some of breast cancer subtypes just by looking at gene expression?
gravatar for jack
15 months ago by
jack750 wrote:

Hi, Probably you all know about PAM50 gene signature for breast cancer sub typing. What I'm looking for is the some of small sets of the genes that their expression levels(expressed or not) helps me to exclude some of the breast cancer subtypes (For example, by looking at gene A and gene B expression levels I can tell something like: this can not be Her2) Does anyone have idea which genes are informative in this regard ?

cancer rna-seq gene genome • 1.3k views
ADD COMMENTlink modified 15 months ago by Kevin Blighe39k • written 15 months ago by jack750
gravatar for Kevin Blighe
15 months ago by
Kevin Blighe39k
Republic of Ireland
Kevin Blighe39k wrote:

Hi Jack,

In order to achieve this, you will have to model your expression data using the PAM50 classifier genes, and then allow the modelling to decide which genes can be best used to predict the breast cancer sub-types. Keep in mind that PAM50 is already a relatively small classifier that can statistically segregate the different sub-types.

You could aim to do multinomial logistic regression because your aim is to use gene expression to predict multiple categorical variables (sub-types). I posted some short code for this, here: Resources for gene signature creation

I would also recommend trying out lasso-penalised regression (α=1), which is what I go over in this answer (below).

You'll require:

  • A gene expression dataset with PAM50 genes filtered in (if you don't know the genes, look here: Where To Download Pam50 Gene Set? )
  • Breast cancer subtypes for samples


Data setup


#Get your data like this:
           SubType   MYC  EGFR KIF2C CDC20
Sample 1  LuminalA 11.39 10.62  9.75 10.34
Sample 2  LuminalB 10.16  8.63  8.68  9.08
Sample 3  LuminalA  9.29 10.24  9.89 10.11
Sample 4      Her2 11.53  9.22  9.35  9.13
Sample 5      Her2  8.35 10.62 10.25 10.01
Sample 6      Her2 11.71 10.43  8.87  9.44

The SubType variable can be a factor:

Levels: Her2 LuminalA LuminalB ...

Now we can build the models:

build a single pass (single fold) lasso-penalised model

lassoModel <- glmnet(
plot(lassoModel, xvar="lambda")


perform 10-fold cross validation

cv.lassoModel <- cv.glmnet(

# plot variable deviances vs. shrinkage parameter, λ (lambda)


identify best predictors

Now we can identify best predictors via 2 metrics: min λ (lamba); 1 standard error of λ

idealLambda <- cv.lassoModel$lambda.min
idealLambda1se <- cv.lassoModel$lambda.1se
print(idealLambda); print(idealLambda1se)

# derive coefficients for each gene
co <- coef(cv.lassoModel, s=idealLambda, exact=TRUE)
co <- coef(cv.lassoModel, s=idealLambda1se, exact=TRUE)

The best predictors will generally be those that have the smallest (possibly zero) coefficient values


For downstream applications, we could build a new model with our best predictors:

finalLasso <- glm(modellingLasso$SubType ~ MYC + ERBB2 + CCND1,

We can also perform ROC analysis:

# ROC analysis

roc <- roc(modellingLasso$SubType, fitted(finalLasso), smooth=FALSE)
  main="Lasso model")
text(0.3, 0.45,
  paste("AUC (L95, AUC, U95):",
    paste(round(ci.auc(roc)[1:3], 3), sep=" ", collapse=","), sep=" "),



Further info:

If you run into difficulty with the lasso-penalised regression, you can modify the stringency by relaxing α (alpha):

# elastic-net regression (here α=0.5)
cv.ElasticModel <- cv.glmnet(..., alpha=0.5, ...)

# ridge regression (α=0)
cv.RidgeModel <- cv.glmnet(..., alpha=0.00, ...)

Also see Professor Hastie's own in-depth tutorial: Glmnet Vignette

ADD COMMENTlink modified 6 weeks ago • written 15 months ago by Kevin Blighe39k

Hi Kevin, I have read your comment on this page and I also have a batch of RNA-seq count to predict survival When I see your data 'modellingLasso' I have two questions: to define the endpoint: if there was two patients, one have survived for 3 years and then died, the other survived for 1 year and is still alive and under follow-up. Then the first one should be died and the other should be alive?but how can we know the later one could survived longer than the first one? Or should we define a survival duration,namely 2 year. and then the first one is 2 year positive and the later one is 2 year negative?

2.does the count matric been normalized and log transformation? which method do you adapted? You use the PAM50 genes and could I use my own datasets with 500+ DEGs?

ADD REPLYlink written 11 months ago by 15617759110

Hello, if you only have 2 patients, then it is not a great situation to have. However, if you have multiple biopsies of the same patients, then it may work - do you have multiple biopsies at different time-points?; or, do you literally just have 2 samples, both taken at the same time-point?

Yes, you can use any number of genes with the lasso regression. The method has previously been used with genome-wide genetic variant data in breast cancer.

ADD REPLYlink written 11 months ago by Kevin Blighe39k

Hello,kevin, thank you for your reply. but for the question 1,I just gave an example ,actually ,I have 300+patients ,and each one do not have biopsy replication. I am just wondering how could I define the endpoints?
For question 2,I'd like to ask that if the expression matrix of my own DEGs should be normalized and log transformation? There is another question,if the DEGs I choose from have high correlations between themselves ,could I choose this method of LASSO? Ridge or plasticnet is better? And I have 150 patients ,there is 1000-6000(varing with my chosen standard log2fc,padj or pvalue) DEGs as input,does this situation work?

ADD REPLYlink written 11 months ago by 15617759110


In that case, I would recommend multiple end-points (and, therefore, multiple analyses). Your end-points can be something like:

  • survived up to end of year 1
  • survived up to end of year 3

These will obviously be encoded as 0 and 1


The data that you use should already be normalised. If you have microarray data, then it will already be on Log2 scale.


The cross-validation part of LASSO regression is partly aimed to mitigate (reduce) the effects of inter-correlated variables. So, be sure that you do cross-validation.

LASSO will fail if it cannot determine differences between your end-points, in which case a less severe regression penalty (like ridge or elastic-net) can be done.

ADD REPLYlink written 11 months ago by Kevin Blighe39k


Thanks a lot, this has been truly useful.

Notwithstanding, I have a theoretical question.

Why is SubType the response variable? Is not suppose that variable is a factor (explanatory variable) and the expression values the response variable?

Thanks for your collaboration.

Best regards,


ADD REPLYlink written 5 weeks ago by judhenaosa0

It depends on your point of view and the question that you are asking. I think that both ways are valid.

ADD REPLYlink written 5 weeks ago by Kevin Blighe39k

Thanks for your help.

I have an expression matrix as I show you n the next table.

Name        y    GAPDH          PGK1            BRACA1          PPIA
Sample_1    1   -3.862188587    4.0051135284    5.3343408926    -1.5971560514
Sample_2    1   -4.3324333866   4.7910247638    6.9940989042    1.3791251185
Sample_3    1   -5.4226451187   3.4368084632    3.4502391931    -5.2178824077
Sample_4    1   -6.1021958608   4.2138262027    5.1661516867    -1.146564008
Sample_5    1   -4.4832777421   4.9770826641    5.0067170526    0.7268057385
Sample_6    0   -7.7504208928   3.7658388974    2.513504636     1.1494149229
Sample_7    0   -7.9455082982   4.848660554     2.3662448291    -1.4117052604
Sample_8    0   -7.8904193457   4.6539708906    7.7405673621    -1.8672276649
Sample_9    0   -5.7225768492   2.4094299737    2.3454939119    -1.3644475688
Sample_10   0   -3.4185800705   4.3094231329    5.9014253872    -1.862157346

I have no information related to this matrix, but I have to create a predictive model. As I know, probably the matrix was created using the "voom" function, which gives negative measures. Then, it is possible to do the model just like the before example.

Note: The Y variable indicates whether a patient was diagnosed with the disease (y=1) or not (y=0).

I truly appreciate your comments.

Best regards,


ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by judhenaosa0

Yes, you can still use the same code for this type of data. What is the distribution of your data, though? You can check with the hist() function (be sure to not include the y variable when you generate the histogram via hist())

ADD REPLYlink written 4 weeks ago by Kevin Blighe39k


I shared the histogram and the overlap normal curve:

data.table$y <- NULL
row.names(data.table) <- data.table$Name
data.table$Name <- NULL

head(data.table) <- unlist(data.table)

hist(,probability = T,ylim = c(0,0.12))
curve(dnorm(x,mean = mean(,sd = sqrt(var(,add = T)

enter image description here

Thanks for your help.

Best regards,


ADD REPLYlink written 4 weeks ago by judhenaosa0

Hola de nuevo. No puedo ver la figura. Se puede encontrar información para cargar una figura aquí: How to add images to a Biostars post

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Kevin Blighe39k


Here a new attempt.

enter image description here

ADD REPLYlink written 4 weeks ago by judhenaosa0

Looks like great data! It seems suitable for the default functionality of cv.glmnet, which expects data to be normally distributed.

ADD REPLYlink written 4 weeks ago by Kevin Blighe39k

Hi Kevin, Thanks for your Help, It's such a great resource. I have built model LASSO for my data set, I am grateful for your tutorials. Thanks again! Secondly, when I got the best predictors I am not able to print it out it's saying (dgCMatrix) Sparse matrix error. I did use as.matrix() to print, but it's of no use. Please help me to save the predictors in a .csv file. Thanks again, Sincerely, Dave

ADD REPLYlink written 4 weeks ago by David_emir310

Hey Dave, it is possible that some of the functionality of these commands has changed since I posted this answer (above). Which command, specifically, are you trying to run?

ADD REPLYlink written 4 weeks ago by Kevin Blighe39k

Hi Kevin, Sorry for the delayed reply, my Lab is in remotest part and to use the Internet I have to travel back to the city. Sorry again! I am using the following command

lassoModel <- glmnet(x=data.matrix(step1[,-1]), y=step1$SubType, standardize=TRUE, alpha=2.0, family="multinomial")
plot(lassoModel, xvar="lambda" , label = TRUE)
plot(lassoModel, xvar="dev" , label = TRUE)

Thanks again! Dave

ADD REPLYlink modified 27 days ago • written 27 days ago by David_emir310

Hey, why do you set alpha = 2.0? Usually it is between 0.0 and 1.0

Okay, so, I will hear from you again when you are next in the city. Do you not have a mobile / cellular network?

ADD REPLYlink written 26 days ago by Kevin Blighe39k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1108 users visited in the last hour