How to exclude some of breast cancer subtypes just by looking at gene expression?
1
1
Entering edit mode
4.0 years ago
jack ▴ 920

Hi, Probably you all know about PAM50 gene signature for breast cancer sub typing. What I'm looking for is the some of small sets of the genes that their expression levels(expressed or not) helps me to exclude some of the breast cancer subtypes (For example, by looking at gene A and gene B expression levels I can tell something like: this can not be Her2) Does anyone have idea which genes are informative in this regard ?

RNA-Seq genome cancer gene • 4.2k views
14
Entering edit mode
4.0 years ago

Edit March 5th, 2020: there is no definitive answer to the OP's question. My original answer (below) is to merely foment ideas on how to address the question.

Hi Jack,

In order to achieve this, you will have to model your expression data using the PAM50 classifier genes, and then allow the modelling to decide which genes can be best used to predict the breast cancer sub-types.

Keep in mind that PAM50 is already a relatively small classifier that can segregate the different sub-types.

You could aim to do multinomial logistic regression because your aim is to use gene expression to predict multiple categorical variables (sub-types). I would also recommend trying out lasso-penalised regression, which is what I go over in this answer (below).

You'll require:

• A gene expression dataset with PAM50 genes filtered in (if you don't know the genes, look here: Where To Download Pam50 Gene Set? )
• Breast cancer subtypes for samples

# Data setup

require(glmnet)

# Get your data like this:
modellingLasso
SubType   MYC  EGFR KIF2C CDC20
Sample 1  LuminalA 11.39 10.62  9.75 10.34
Sample 2  LuminalB 10.16  8.63  8.68  9.08
Sample 3  LuminalA  9.29 10.24  9.89 10.11
Sample 4      Her2 11.53  9.22  9.35  9.13
Sample 5      Her2  8.35 10.62 10.25 10.01
Sample 6      Her2 11.71 10.43  8.87  9.44
...


The SubType variable can be a factor:

modellingLasso$SubType ... Levels: Her2 LuminalA LuminalB ...  Now we can build the models: # build a single pass (single fold) lasso-penalised model lassoModel <- glmnet( x=data.matrix(modellingLasso[,-1]), y=modellingLasso$SubType,
standardize=TRUE,
alpha=1.0,
family="multinomial")
plot(lassoModel, xvar="lambda")


# perform 10-fold cross validation

cv.lassoModel <- cv.glmnet(
x=data.matrix(modellingLasso[,-1]),
idealLambda1se <- cv.lassoModel$lambda.1se print(idealLambda); print(idealLambda1se) # derive coefficients for each gene co <- coef(cv.lassoModel, s=idealLambda, exact=TRUE) co co.se <- coef(cv.lassoModel, s=idealLambda1se, exact=TRUE) co.se # identify predictors for each sub-type rownames(co.se$LuminalA)[which(co.se$LuminalA != 0)] rownames(co.se$LuminalB)[which(co.se$LuminalB != 0)] rownames(co.se$Her2)[which(co.se$Her2 != 0)]  # Downstream For downstream applications, we could build a new model with our best predictors: finalLasso <- glm(modellingLasso$SubType ~ MYC + ERBB2 + CCND1,
data=modellingLasso,
summary(finalLasso)


We can also perform ROC analysis:

# ROC analysis
require(pROC)

roc <- roc(modellingLasso$SubType, fitted(finalLasso), smooth=FALSE) plot.roc( roc, grid=TRUE, grid.lwd=2, col="royalblue", main="Lasso model") text(0.3, 0.45, col="royalblue", paste("AUC (L95, AUC, U95):", paste(round(ci.auc(roc)[1:3], 3), sep=" ", collapse=","), sep=" "), cex=0.7)  # Further info: If you run into difficulty with the lasso-penalised regression, you can modify the stringency by relaxing α (alpha): # elastic-net regression (here α=0.5) cv.ElasticModel <- cv.glmnet(..., alpha=0.5, ...) # ridge regression (α=0) cv.RidgeModel <- cv.glmnet(..., alpha=0.00, ...)  Also see Professor Hastie's own in-depth tutorial: Glmnet Vignette ADD COMMENT 0 Entering edit mode Hi Kevin, I have read your comment on this page and I also have a batch of RNA-seq count to predict survival When I see your data 'modellingLasso' I have two questions: 1.how to define the endpoint: if there was two patients, one have survived for 3 years and then died, the other survived for 1 year and is still alive and under follow-up. Then the first one should be died and the other should be alive?but how can we know the later one could survived longer than the first one? Or should we define a survival duration,namely 2 year. and then the first one is 2 year positive and the later one is 2 year negative? 2.does the count matric been normalized and log transformation? which method do you adapted? You use the PAM50 genes and could I use my own datasets with 500+ DEGs? ADD REPLY 0 Entering edit mode Hello, if you only have 2 patients, then it is not a great situation to have. However, if you have multiple biopsies of the same patients, then it may work - do you have multiple biopsies at different time-points?; or, do you literally just have 2 samples, both taken at the same time-point? Yes, you can use any number of genes with the lasso regression. The method has previously been used with genome-wide genetic variant data in breast cancer. ADD REPLY 0 Entering edit mode Hello,kevin, thank you for your reply. but for the question 1,I just gave an example ,actually ,I have 300+patients ,and each one do not have biopsy replication. I am just wondering how could I define the endpoints? For question 2,I'd like to ask that if the expression matrix of my own DEGs should be normalized and log transformation? There is another question,if the DEGs I choose from have high correlations between themselves ,could I choose this method of LASSO? Ridge or plasticnet is better? And I have 150 patients ,there is 1000-6000(varing with my chosen standard log2fc,padj or pvalue) DEGs as input,does this situation work? ADD REPLY 0 Entering edit mode Hello! In that case, I would recommend multiple end-points (and, therefore, multiple analyses). Your end-points can be something like: • survived up to end of year 1 • survived up to end of year 3 These will obviously be encoded as 0 and 1 ## -------------------------------- The data that you use should already be normalised. If you have microarray data, then it will already be on Log2 scale. ## ------------------------------------- The cross-validation part of LASSO regression is partly aimed to mitigate (reduce) the effects of inter-correlated variables. So, be sure that you do cross-validation. LASSO will fail if it cannot determine differences between your end-points, in which case a less severe regression penalty (like ridge or elastic-net) can be done. ADD REPLY 0 Entering edit mode Hi, Thanks a lot, this has been truly useful. Notwithstanding, I have a theoretical question. Why is SubType the response variable? Is not suppose that variable is a factor (explanatory variable) and the expression values the response variable? Thanks for your collaboration. Best regards, Juan ADD REPLY 0 Entering edit mode It depends on your point of view and the question that you are asking. I think that both ways are valid. ADD REPLY 0 Entering edit mode Thanks for your help. I have an expression matrix as I show you n the next table. Name y GAPDH PGK1 BRACA1 PPIA Sample_1 1 -3.862188587 4.0051135284 5.3343408926 -1.5971560514 Sample_2 1 -4.3324333866 4.7910247638 6.9940989042 1.3791251185 Sample_3 1 -5.4226451187 3.4368084632 3.4502391931 -5.2178824077 Sample_4 1 -6.1021958608 4.2138262027 5.1661516867 -1.146564008 Sample_5 1 -4.4832777421 4.9770826641 5.0067170526 0.7268057385 Sample_6 0 -7.7504208928 3.7658388974 2.513504636 1.1494149229 Sample_7 0 -7.9455082982 4.848660554 2.3662448291 -1.4117052604 Sample_8 0 -7.8904193457 4.6539708906 7.7405673621 -1.8672276649 Sample_9 0 -5.7225768492 2.4094299737 2.3454939119 -1.3644475688 Sample_10 0 -3.4185800705 4.3094231329 5.9014253872 -1.862157346  I have no information related to this matrix, but I have to create a predictive model. As I know, probably the matrix was created using the "voom" function, which gives negative measures. Then, it is possible to do the model just like the before example. Note: The Y variable indicates whether a patient was diagnosed with the disease (y=1) or not (y=0). I truly appreciate your comments. Best regards, Juan ADD REPLY 0 Entering edit mode Yes, you can still use the same code for this type of data. What is the distribution of your data, though? You can check with the hist() function (be sure to not include the y variable when you generate the histogram via hist()) ADD REPLY 0 Entering edit mode Hi. I shared the histogram and the overlap normal curve: data.table$y <- NULL
row.names(data.table) <- data.table$Name data.table$Name <- NULL

all.data <- unlist(data.table)

hist(all.data,probability = T,ylim = c(0,0.12))
curve(dnorm(x,mean = mean(all.data),sd = sqrt(var(all.data))),add = T)


Best regards,

Juan

0
Entering edit mode

Hola de nuevo. No puedo ver la figura. Se puede encontrar información para cargar una figura aquí: How to add images to a Biostars post

0
Entering edit mode

Sorry.

Here a new attempt.

0
Entering edit mode

Looks like great data! It seems suitable for the default functionality of cv.glmnet, which expects data to be normally distributed.

0
Entering edit mode

Hi Kevin, Thanks for your Help, It's such a great resource. I have built model LASSO for my data set, I am grateful for your tutorials. Thanks again! Secondly, when I got the best predictors I am not able to print it out it's saying (dgCMatrix) Sparse matrix error. I did use as.matrix() to print, but it's of no use. Please help me to save the predictors in a .csv file. Thanks again, Sincerely, Dave

0
Entering edit mode

Hey Dave, it is possible that some of the functionality of these commands has changed since I posted this answer (above). Which command, specifically, are you trying to run?

0
Entering edit mode

Hi Kevin, Sorry for the delayed reply, my Lab is in remotest part and to use the Internet I have to travel back to the city. Sorry again! I am using the following command

lassoModel <- glmnet(x=data.matrix(step1[,-1]), y=step1$SubType, standardize=TRUE, alpha=2.0, family="multinomial") plot(lassoModel, xvar="lambda" , label = TRUE) plot(lassoModel, xvar="dev" , label = TRUE) print(lassoModel)  Thanks again! Dave ADD REPLY 0 Entering edit mode Hey, why do you set alpha = 2.0? Usually it is between 0.0 and 1.0 Okay, so, I will hear from you again when you are next in the city. Do you not have a mobile / cellular network? ADD REPLY 0 Entering edit mode Hi Kevin, I have expression profile of the list of genes (say ~100 genes) and i want to select best features out of total list. I have came across this link and tried executing the code with glmnet. However, i am getting error at the step where i want identify best predictors for each class (binomial categories; pCR and RD) the below code giving i am using modellingLasso$Response

lassoModel <- glmnet( x=data.matrix(modellingLasso[,-1]), y=modellingLasso$Response, standardize=TRUE, alpha=0.5, family="binomial") plot(lassoModel, xvar="lambda") print(lassoModel) cv.lassoModel <- cv.glmnet( x=data.matrix(modellingLasso[,-1]), y=modellingLasso$Response, nfolds=10, family="binomial")

# plot variable deviances vs. shrinkage parameter, λ (lambda)

plot(cv.lassoModel) print(cv.lassoModel)

idealLambda <- cv.lassoModel$lambda.min idealLambda1se <- cv.lassoModel$lambda.1se

print(idealLambda); print(idealLambda1se)

# derive coefficients for each gene

co <- coef(cv.lassoModel, s=idealLambda, exact=TRUE) co

co.se <- coef(cv.lassoModel, s=idealLambda1se, exact=TRUE) co.se

# identify predictors for each sub-type

(rownamesco.se$pCR)[whichco.se$pCR != 0]

0
Entering edit mode

Hi, you probably need

rownames(co.se$pCR)[whichco.se$pCR != 0]

0
Entering edit mode

Hi Kevin,

Thanks for your prompt response, I tried running that earlier also, but getting error

**rownames(co.se$pCR)[whichco.se$pCR != 0]

Error in co.se$pCR :$ operator not defined for this S4 class**

Then i searched S4 class error and found that i could not access the slots of an S4 object using $operator. Instead it was suggested to use @ operator. I tried with @ operotor as well, but getting new error rownames(co.se@pCR)[whichco.se@pCR != 0] Error in rownames(co.se@pCR) : no slot of name "pCR" for this object of class "dgCMatrix" I am not how to fix this. Best Inayat ADD REPLY 0 Entering edit mode Hi Inayat. I think that there is an issue with rendering of my code in my post (above). The commands shoudl be: rownames(co.se$LuminalA)[which(co.se$LuminalA != 0)] rownames(co.se$LuminalB)[which(co.se$LuminalB != 0)] rownames(co.se$Her2)[which(co.se\$Her2 != 0)]


Please note the locations of the parentheses, (

0
Entering edit mode

Thanks Kevin, I will try this

Best Inayat