Question: Building a predictive model by a list of genes
0
gravatar for F
9 weeks ago by
F3.4k
Iran
F3.4k wrote:

I have raw read counts of 56 patients who have been ranked with Mandard score as responders and non responders;

I have done differential expression by DESeq2 and I obtained a list of DEGs between responders and non responders; How I can build a model by these genes that is able to predict responders and non responders? I mean How I can train a model by these genes to stratify patients into groups of responders or non-responses.

Gene    Fold Change TRG
    CHGA    -1.5652029  TRG 4-5 Signature
    IL1B    -1.3159235  TRG 4-5 Signature
    CXCL8   -1.231194003    TRG 4-5 Signature
    ALB -1.2116047  TRG 4-5 Signature
    PTGS2   -1.1674647  TRG 4-5 Signature
    CSF3    -1.12757675 TRG 4-5 Signature
    OSM -1.113489387    TRG 4-5 Signature
    IL6 -1.107777024    TRG 4-5 Signature
    ISG15   -1.070607929    TRG 4-5 Signature
    IFIT1   -1.0214389  TRG 4-5 Signature
    CT45_family -1.0097384  TRG 4-5 Signature
    CXCL6   -1.003206221    TRG 4-5 Signature
    MAGEA1  -1.001533309    TRG 4-5 Signature
    S100A7A 1.0501062   TRG 1-2 Signature 
    KIF5C   1.146357    TRG 1-2 Signature 
    HLA-DQA2    1.1829847   TRG 1-2 Signature 
    KRT14   1.338032    TRG 1-2 Signature 
    MPPED1  1.3464259   TRG 1-2 Signature 
    CDK6    1.43895365  TRG 1-2 Signature 
    KLK5    1.470414    TRG 1-2 Signature 
    ANKRD30A    1.4728634   TRG 1-2 Signature 
    CALML5  1.4947388   TRG 1-2 Signature

For example for building a linear regression model I should have hypothesis that could be if the expression of gene A in a patient gets higher than this cut-off, this patient would graded as responders but I don't have any threshould

Please look at the mean of expression of top 20 differentially expressed genes in each group; Their expression seems pretty close to each other so I am not sure how to report a strong signature of genes that are able to predict responders and non-responders

> rowMeans(TRG12)
      ALB    CALML5      CDK6      CHGA      CSF3     CXCL6     CXCL8      IL1B       IL6     ISG15      KLK5     KRT14 
 6.234833  7.638252 10.358273  7.116288  7.646012  9.482543  9.047842  9.803639  7.811549 11.651569  9.054484 11.636721 
   MAGEA1       OSM     PTGS2   S100A7A 
 6.043237  8.052714  9.228111  7.885017 
> rowMeans(TRG45)
      ALB    CALML5      CDK6      CHGA      CSF3     CXCL6     CXCL8      IL1B       IL6     ISG15      KLK5     KRT14 
 6.531966  7.148282  9.771275  8.061407  7.976051  9.833902  9.530818 10.199655  8.471593 12.278263  8.762199 10.925208 
   MAGEA1       OSM     PTGS2   S100A7A 
 6.209878  8.223946  9.703970  7.565160
rna-seq machine learning R • 306 views
ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by F3.4k
5
gravatar for Kevin Blighe
9 weeks ago by
Kevin Blighe39k
Republic of Ireland
Kevin Blighe39k wrote:

With DESeq2, transform your normalised counts via the vst or rlog transformations, and then use the resulting data for the model building. For the modelling, your response (y) variable is then TRG (encoded as a factor) and the predictors (x) are the DEGs.

That is, get your data like this:

df
              TRG    MYC   EGFR  KIF2C CDC20
    Sample 1  TRG12  11.39 10.62  9.75 10.34
    Sample 2  TRG12  10.16  8.63  8.68  9.08
    Sample 3  TRG12   9.29 10.24  9.89 10.11
    Sample 4  TRG45  11.53  9.22  9.35  9.13
    Sample 5  TRG45   8.35 10.62 10.25 10.01
    Sample 6  TRG45  11.71 10.43  8.87  9.44
    ...

Then, either test each gene independently in a separate model and keep the best predictors:

glm(TRG ~ MYC, data = df, family = binomial(link = 'logit'))
glm(TRG ~ EGFR, data = df, family = binomial(link = 'logit'))
*et cetera*

...or build a combined model and reduce the predictors via stepwise regression.

require(MASS)

null <- glm(TRG ~ 1, data = df, family = binomial(link = 'logit'))
full <- glm(TRG ~ MYC + EGFR + ... + geneZ, data = df, family = binomial(link = 'logit'))

#Step through the variables for stepwise regression
both <- stepAIC(null, scope=list(lower=null, upper=full), direction="both")
forward <- stepAIC(null, scope=list(lower=null, upper=full), direction="forward")
backward <- stepAIC(full, direction="backward")

Once you have your final list of best predictors, build a final model that contains the best predictors from the steps above:

final <- glm(TRG ~ CDK6 + CHGA + CSF3 + CXCL8, data = df, family = binomial(link = 'logit'))

With the final model, you should perform cross validation and ROC analysis. You can also test sensitivity, specificity, precision, and accuracy, and perform model predictions on independent datasets. See here:

Kevin

ADD COMMENTlink modified 8 weeks ago • written 9 weeks ago by Kevin Blighe39k

Thanks a lot, very helpful as always;

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by F3.4k
1

It means that the function cannot find those genes (CDK6 and OAZ1) in your data-frame, df. Please check the column names of your data-frame:

colnames(df)
ADD REPLYlink written 9 weeks ago by Kevin Blighe39k

Hi Kevin,

I tested each of my genes independently in a separate model like

 glm(TRG ~ CDK6, data = df, family = binomial(link = 'logit'))

and kept 5 genes as best predictors usin the least p value and AIC by which I built final model using

final <- glm(TRG ~ CDK6 + CHGA + CSF3 + CXCL8 + IL6, data = df, family = binomial(link = 'logit'))

I obtained this ROC curve from final model

enter image description here

Then I used your tutorial on lasso model in this post

A: How to exclude some of breast cancer subtypes just by looking at gene expressio by which I obtained 3 significant genes from which only CDK6 is common between two models

I obtained this ROC

enter image description here

enter image description here

My question of you please is; How I could know significant predictors from which model is more correct? Also, is it ususl that I am obtaining only 3 significant predictors from 23 differentially genes?

To be honest, without your tutorials I would need months to head around to get idea

Thank you man

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by F3.4k
1

Without rigorous testing (on other datasets and on real-life samples), it is difficult to know which model is best. However, in your results, you can present multiple different models and state the AUC from ROC analysis for each.

You should also be applying cross validation to these models (cv.glm()) and then checking the Delta (Δ) values. For a 'good' model, change in Δ should be small; however, if the model is 'bad', then change in Δ will be large.

#Perform cross validation (CV) analysis
#The delta values should not greatly differ
#K=number of samples, i.e., leave-one-out CV.
library(boot)
cv.glm(MyData, MyModel, K=nrow(MyData))$delta

K here is the number of 'nfolds'. You may want to read more about cross validation in order to understand this. Generally, the minimum value for K should be 3. When K = number of samples in dataset, it is called 'leave-one-out' cross validation.

--------------------------------

Edit: if you used cv.glmnet(), then your lasso model is already cross-validated (this may be why there are only 3 predictors). For your glm() model from stewpwise regression, however, you need to cross-validate this with cv.glm(), as I explain above.

More here: A: Multinomial elastic net implementation on microarray dataset , error at lognet a

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by Kevin Blighe39k

I could not thank you enough. When I am building my data table as you suggest and I put TRG as dependent variable with two levels TRG12 and TRG45 I am getting a result totally different with when I am putting 1 and 0 as levels of my TRG. I guessed for logistic regression numeric or integer values like 1 , 0 or YES and NO would return the same thing. Please correct me if possible

ADD REPLYlink written 9 weeks ago by F3.4k
1

When you convert to 1 and 0, it could be that these are coded numerically - they must be encoded as factors. Also, the reference level has to be the same in each case.

ADD REPLYlink written 9 weeks ago by Kevin Blighe39k

Sorry Kevin, as I mentioned individual glm gave me 5 genes as significant, I put them in a final model like

final <- glm(TRG ~ CDK6 + CXCL8 + IL6 + ISG15 + PTGS2 , data = df, family = binomial(link = 'logit'))

But summary is frustrating, says that only one gene is significant while all 5 genes have been selected individually as significant :(

> summary(final)

Call:
glm(formula = TRG ~ CDK6 + CXCL8 + IL6 + ISG15 + PTGS2, family = binomial(link = "logit"), 
    data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5696  -0.9090   0.2233   0.7125   2.0131  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)   3.2258    13.3877   0.241   0.8096  
CDK6         -2.6101     1.0860  -2.403   0.0162 *
CXCL8         0.8702     0.9783   0.889   0.3737  
IL6           0.9591     0.8883   1.080   0.2803  
ISG15         0.9574     0.5101   1.877   0.0605 .
PTGS2        -0.4623     1.1005  -0.420   0.6744  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 77.347  on 55  degrees of freedom
Residual deviance: 53.917  on 50  degrees of freedom
AIC: 65.917

Number of Fisher Scoring iterations: 6

>
ADD REPLYlink written 9 weeks ago by F3.4k
1

A gene does not require a statistically significant p-value in order for it to provide useful information in a regression model. From the information above, I would say that ISG15 shows a 'trend toward statistical significance'. Did the forward, backward, and 'both' stepwise regression procedures produce the same output?

ADD REPLYlink written 9 weeks ago by Kevin Blighe39k

No Kevin, I only put each of 23 genes individually in

glm(TRG ~ MYC, data = df, family = binomial(link = 'logit'))

And select ones returning p-value < 0.05 that gave me 5 genes that I put them in a final model

final <- glm(TRG ~ CDK6 + CXCL8 + IL6 + ISG15 + PTGS2 , data = df, family = binomial(link = 'logit'))

where the summary says only CDK6 is statistically significant

To be honest I am not sure how to judge if a gene could remain as a good predictor or not, I just read lower AIC and p-value and for AIC I am not sure based on which threshold I judge this gene could remain in model or not

ADD REPLYlink written 9 weeks ago by F3.4k
1

Sorry, when you create the model with 23 genes, you must perform stepwise regression on that. Please review the code in my original answer:

require(MASS)

null <- glm(TRG ~ 1, data = df, family = binomial(link = 'logit'))
full <- glm(TRG ~ MYC + EGFR + ... + geneZ, data = df, family = binomial(link = 'logit'))

#Step through the variables for stepwise regression
both <- stepAIC(null, scope=list(lower=null, upper=full), direction="both")
forward <- stepAIC(null, scope=list(lower=null, upper=full), direction="forward")
backward <- stepAIC(full, direction="backward")

If you are not sure about this procedure, then please consult a statistician.

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by Kevin Blighe39k
1

Thank you very much :)

ADD REPLYlink written 9 weeks ago by F3.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1981 users visited in the last hour