I have raw read counts of 56 patients who have been ranked with Mandard score as responders and non responders;

I have done differential expression by DESeq2 and I obtained a list of DEGs between responders and non responders; How I can build a model by these genes that is able to predict responders and non responders? I mean How I can train a model by these genes to stratify patients into groups of responders or non-responses.

```
Gene Fold Change TRG
CHGA -1.5652029 TRG 4-5 Signature
IL1B -1.3159235 TRG 4-5 Signature
CXCL8 -1.231194003 TRG 4-5 Signature
ALB -1.2116047 TRG 4-5 Signature
PTGS2 -1.1674647 TRG 4-5 Signature
CSF3 -1.12757675 TRG 4-5 Signature
OSM -1.113489387 TRG 4-5 Signature
IL6 -1.107777024 TRG 4-5 Signature
ISG15 -1.070607929 TRG 4-5 Signature
IFIT1 -1.0214389 TRG 4-5 Signature
CT45_family -1.0097384 TRG 4-5 Signature
CXCL6 -1.003206221 TRG 4-5 Signature
MAGEA1 -1.001533309 TRG 4-5 Signature
S100A7A 1.0501062 TRG 1-2 Signature
KIF5C 1.146357 TRG 1-2 Signature
HLA-DQA2 1.1829847 TRG 1-2 Signature
KRT14 1.338032 TRG 1-2 Signature
MPPED1 1.3464259 TRG 1-2 Signature
CDK6 1.43895365 TRG 1-2 Signature
KLK5 1.470414 TRG 1-2 Signature
ANKRD30A 1.4728634 TRG 1-2 Signature
CALML5 1.4947388 TRG 1-2 Signature
```

For example for building a linear regression model I should have hypothesis that could be if the expression of gene A in a patient gets higher than this cut-off, this patient would graded as responders but I don't have any threshould

Please look at the mean of expression of top 20 differentially expressed genes in each group; Their expression seems pretty close to each other so I am not sure how to report a strong signature of genes that are able to predict responders and non-responders

```
> rowMeans(TRG12)
ALB CALML5 CDK6 CHGA CSF3 CXCL6 CXCL8 IL1B IL6 ISG15 KLK5 KRT14
6.234833 7.638252 10.358273 7.116288 7.646012 9.482543 9.047842 9.803639 7.811549 11.651569 9.054484 11.636721
MAGEA1 OSM PTGS2 S100A7A
6.043237 8.052714 9.228111 7.885017
> rowMeans(TRG45)
ALB CALML5 CDK6 CHGA CSF3 CXCL6 CXCL8 IL1B IL6 ISG15 KLK5 KRT14
6.531966 7.148282 9.771275 8.061407 7.976051 9.833902 9.530818 10.199655 8.471593 12.278263 8.762199 10.925208
MAGEA1 OSM PTGS2 S100A7A
6.209878 8.223946 9.703970 7.565160
```

Thanks a lot, very helpful as always;

It means that the function cannot find those genes (

CDK6andOAZ1) in your data-frame,`df`

. Please check the column names of your data-frame:Hi Kevin,

I tested each of my genes independently in a separate model like

and kept 5 genes as best predictors usin the least p value and AIC by which I built final model using

I obtained this ROC curve from final model

Then I used your tutorial on lasso model in this post

A: How to exclude some of breast cancer subtypes just by looking at gene expressio by which I obtained 3 significant genes from which only CDK6 is common between two models

I obtained this ROC

My question of you please is; How I could know significant predictors from which model is more correct? Also, is it ususl that I am obtaining only 3 significant predictors from 23 differentially genes?

To be honest, without your tutorials I would need months to head around to get idea

Thank you man

Without rigorous testing (on other datasets and on real-life samples), it is difficult to know which model is best. However, in your results, you can present multiple different models and state the AUC from ROC analysis for each.

You should also be applying cross validation to these models (

`cv.glm()`

) and then checking the`Delta`

(`Δ`

) values. For a 'good' model, change in`Δ`

should be small; however, if the model is 'bad', then change in`Δ`

will be large.`K`

here is the number of 'nfolds'. You may want to read more about cross validation in order to understand this. Generally, the minimum value for`K`

should be 3. When`K = number of samples in dataset`

, it is called 'leave-one-out' cross validation.## --------------------------------

Edit: if you used

`cv.glmnet()`

, then your lasso model is already cross-validated (this may be why there are only 3 predictors). For your`glm()`

model from stewpwise regression, however, you need to cross-validate this with`cv.glm()`

, as I explain above.More here: A: Multinomial elastic net implementation on microarray dataset , error at lognet a

I could not thank you enough. When I am building my data table as you suggest and I put TRG as dependent variable with two levels

`TRG12 and TRG45`

I am getting a result totally different with when I am putting`1 and 0`

as levels of my TRG. I guessed for logistic regression numeric or integer values like`1 , 0 or YES and NO`

would return the same thing. Please correct me if possibleWhen you convert to

`1`

and`0`

, it could be that these are coded numerically - they must be encoded as factors. Also, the reference level has to be the same in each case.Sorry Kevin, as I mentioned individual glm gave me 5 genes as significant, I put them in a final model like

But summary is frustrating, says that only one gene is significant while all 5 genes have been selected individually as significant :(

A gene does not require a statistically significant p-value in order for it to provide useful information in a regression model. From the information above, I would say that

ISG15shows a 'trend toward statistical significance'. Did the forward, backward, and 'both' stepwise regression procedures produce the same output?No Kevin, I only put each of 23 genes individually in

And select ones returning p-value < 0.05 that gave me 5 genes that I put them in a final model

where the summary says only CDK6 is statistically significant

To be honest I am not sure how to judge if a gene could remain as a good predictor or not, I just read lower AIC and p-value and for AIC I am not sure based on which threshold I judge this gene could remain in model or not

Sorry, when you create the model with 23 genes, you must perform stepwise regression on that. Please review the code in my original answer:

If you are not sure about this procedure, then please consult a statistician.

Thank you very much :)

Hi @Kevin. Thank you very much for all your helpful answers. I built a combined model for my data

I got these warnings:

I searched and came across with this link I changed the code to this:

The first warning gone but the second one still remained. Could you please tell how I can fix it?

In layman's terms, it means that there are too many parameters in your model. It could be mitigated by increasing your sample

n. Are you aiming to reduce the model parameters?; do you need all of these genes?Thank you @Kevin! Yes I want to reduce the predictors via stepwise regression as you said. I have 66 samples (46 non-relapse and 20 relapse). These 45 genes are the genes that I got from differential gene expression analysis. I reduced the number of genes to 24 genes, but still get the same warning. What do you think I should do?

In that case, you may consider using, instead, a regularised regression, i.e., with the lasso penalty. Have you used that previously?

Many thanks @Kevin for taking the time to address my questions. No, I've never used it before. I really appreciate if you help me with that.

Hey, there is a really great tutorial, here: https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html

I have also posted some material on Biostars, so, if you search, you should find it

Thank you Kevin. I'll go through that. Could you please what is the reason that I got this warning and how lasso penalty will solve it?

Basically your previous model was likely 'too good' for the data that you have. This issue can occur when you have too many parameters in the model and / or when you have a small dataset.

Regression with the lasso penalty introduces a penalty / weighting for each parameter, which allows it to work with such problematic models as these. I give a very quick overview here: A: How to exclude some of breast cancer subtypes just by looking at gene expressio

I saw your comprehensive post here about how to build a lasso-penalised model. I did based on your instruction and I got this coefficient for each gene:

These are the result based on

metrics: min λ (lamba). sorry for naive question. I am new to this field. could you please tell me what is the meaning of those dots? In your post you mentioned that "The best predictors will generally be those that have the smallest (possibly zero) coefficient values". You provided two coefficients "co" and "co_se". based on which one we have to select the best predictor? Many thanks!