Question

Building a predictive model by a list of genes

5

Entering edit mode

5.3 years ago

zizigolu ★ 4.3k

I have raw read counts of 56 patients who have been ranked with Mandard score as responders and non responders;

I have done differential expression by DESeq2 and I obtained a list of DEGs between responders and non responders; How I can build a model by these genes that is able to predict responders and non responders? I mean How I can train a model by these genes to stratify patients into groups of responders or non-responses.

Gene    Fold Change TRG
    CHGA    -1.5652029  TRG 4-5 Signature
    IL1B    -1.3159235  TRG 4-5 Signature
    CXCL8   -1.231194003    TRG 4-5 Signature
    ALB -1.2116047  TRG 4-5 Signature
    PTGS2   -1.1674647  TRG 4-5 Signature
    CSF3    -1.12757675 TRG 4-5 Signature
    OSM -1.113489387    TRG 4-5 Signature
    IL6 -1.107777024    TRG 4-5 Signature
    ISG15   -1.070607929    TRG 4-5 Signature
    IFIT1   -1.0214389  TRG 4-5 Signature
    CT45_family -1.0097384  TRG 4-5 Signature
    CXCL6   -1.003206221    TRG 4-5 Signature
    MAGEA1  -1.001533309    TRG 4-5 Signature
    S100A7A 1.0501062   TRG 1-2 Signature 
    KIF5C   1.146357    TRG 1-2 Signature 
    HLA-DQA2    1.1829847   TRG 1-2 Signature 
    KRT14   1.338032    TRG 1-2 Signature 
    MPPED1  1.3464259   TRG 1-2 Signature 
    CDK6    1.43895365  TRG 1-2 Signature 
    KLK5    1.470414    TRG 1-2 Signature 
    ANKRD30A    1.4728634   TRG 1-2 Signature 
    CALML5  1.4947388   TRG 1-2 Signature

For example for building a linear regression model I should have hypothesis that could be if the expression of gene A in a patient gets higher than this cut-off, this patient would graded as responders but I don't have any threshould

Please look at the mean of expression of top 20 differentially expressed genes in each group; Their expression seems pretty close to each other so I am not sure how to report a strong signature of genes that are able to predict responders and non-responders

> rowMeans(TRG12)
      ALB    CALML5      CDK6      CHGA      CSF3     CXCL6     CXCL8      IL1B       IL6     ISG15      KLK5     KRT14 
 6.234833  7.638252 10.358273  7.116288  7.646012  9.482543  9.047842  9.803639  7.811549 11.651569  9.054484 11.636721 
   MAGEA1       OSM     PTGS2   S100A7A 
 6.043237  8.052714  9.228111  7.885017 
> rowMeans(TRG45)
      ALB    CALML5      CDK6      CHGA      CSF3     CXCL6     CXCL8      IL1B       IL6     ISG15      KLK5     KRT14 
 6.531966  7.148282  9.771275  8.061407  7.976051  9.833902  9.530818 10.199655  8.471593 12.278263  8.762199 10.925208 
   MAGEA1       OSM     PTGS2   S100A7A 
 6.209878  8.223946  9.703970  7.565160

R RNA-Seq Machine learning • 6.4k views

ADD COMMENT • link updated 2.9 years ago by Kevin Blighe 87k • written 5.3 years ago by zizigolu ★ 4.3k

GenoMax · Accepted Answer · 2019-01-10

17

Entering edit mode

5.3 years ago

Kevin Blighe 87k

With DESeq2, import your raw counts, normalise them, and then transform these normalised counts via the vst or rlog transformations, and then use the resulting data for the model building. For the modelling, your response (y) variable is then TRG (encoded as a factor) and the predictors (x) are the DEGs.

That is, get your data like this:

df
              TRG    MYC   EGFR  KIF2C CDC20
    Sample 1  TRG12  11.39 10.62  9.75 10.34
    Sample 2  TRG12  10.16  8.63  8.68  9.08
    Sample 3  TRG12   9.29 10.24  9.89 10.11
    Sample 4  TRG45  11.53  9.22  9.35  9.13
    Sample 5  TRG45   8.35 10.62 10.25 10.01
    Sample 6  TRG45  11.71 10.43  8.87  9.44
    ...

Then, either test each gene independently in a separate model and keep the best predictors:

glm(TRG ~ MYC, data = df, family = binomial(link = 'logit'))
glm(TRG ~ EGFR, data = df, family = binomial(link = 'logit'))
*et cetera*

...or build a combined model and reduce the predictors via stepwise regression.

require(MASS)

null <- glm(TRG ~ 1, data = df, family = binomial(link = 'logit'))
full <- glm(TRG ~ MYC + EGFR + ... + geneZ, data = df, family = binomial(link = 'logit'))

#Step through the variables for stepwise regression
both <- stepAIC(null, scope=list(lower=null, upper=full), direction="both")
forward <- stepAIC(null, scope=list(lower=null, upper=full), direction="forward")
backward <- stepAIC(full, direction="backward")

Once you have your final list of best predictors, build a final model that contains the best predictors from the steps above:

final <- glm(TRG ~ CDK6 + CHGA + CSF3 + CXCL8, data = df, family = binomial(link = 'logit'))

With the final model, you should perform cross validation and ROC analysis. You can also test sensitivity, specificity, precision, and accuracy, and perform model predictions on independent datasets. See here:

Kevin

ADD COMMENT • link 4.0 years ago by Kevin Blighe 87k

0

Entering edit mode

Thanks a lot, very helpful as always;

ADD REPLY • link 5.3 years ago by zizigolu ★ 4.3k

1

Entering edit mode

It means that the function cannot find those genes (CDK6 and OAZ1) in your data-frame, df. Please check the column names of your data-frame:

colnames(df)

ADD REPLY • link 5.3 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi Kevin,

I tested each of my genes independently in a separate model like

 glm(TRG ~ CDK6, data = df, family = binomial(link = 'logit'))

and kept 5 genes as best predictors usin the least p value and AIC by which I built final model using

final <- glm(TRG ~ CDK6 + CHGA + CSF3 + CXCL8 + IL6, data = df, family = binomial(link = 'logit'))

I obtained this ROC curve from final model

enter image description here

Then I used your tutorial on lasso model in this post

A: How to exclude some of breast cancer subtypes just by looking at gene expressio by which I obtained 3 significant genes from which only CDK6 is common between two models

I obtained this ROC

enter image description here

My question of you please is; How I could know significant predictors from which model is more correct? Also, is it ususl that I am obtaining only 3 significant predictors from 23 differentially genes?

To be honest, without your tutorials I would need months to head around to get idea

Thank you man

ADD REPLY • link 5.3 years ago by zizigolu ★ 4.3k

1

Entering edit mode

Without rigorous testing (on other datasets and on real-life samples), it is difficult to know which model is best. However, in your results, you can present multiple different models and state the AUC from ROC analysis for each.

You should also be applying cross validation to these models (cv.glm()) and then checking the Delta (Δ) values. For a 'good' model, change in Δ should be small; however, if the model is 'bad', then change in Δ will be large.

#Perform cross validation (CV) analysis
#The delta values should not greatly differ
#K=number of samples, i.e., leave-one-out CV.
library(boot)
cv.glm(MyData, MyModel, K=nrow(MyData))$delta

K here is the number of 'nfolds'. You may want to read more about cross validation in order to understand this. Generally, the minimum value for K should be 3. When K = number of samples in dataset, it is called 'leave-one-out' cross validation.

--------------------------------

Edit: if you used cv.glmnet(), then your lasso model is already cross-validated (this may be why there are only 3 predictors). For your glm() model from stewpwise regression, however, you need to cross-validate this with cv.glm(), as I explain above.

More here: A: Multinomial elastic net implementation on microarray dataset , error at lognet a

ADD REPLY • link 5.3 years ago by Kevin Blighe 87k

0

Entering edit mode

I could not thank you enough. When I am building my data table as you suggest and I put TRG as dependent variable with two levels TRG12 and TRG45 I am getting a result totally different with when I am putting 1 and 0 as levels of my TRG. I guessed for logistic regression numeric or integer values like 1 , 0 or YES and NO would return the same thing. Please correct me if possible

ADD REPLY • link 5.3 years ago by zizigolu ★ 4.3k

1

Entering edit mode

When you convert to 1 and 0, it could be that these are coded numerically - they must be encoded as factors. Also, the reference level has to be the same in each case.

ADD REPLY • link 5.3 years ago by Kevin Blighe 87k

0

Entering edit mode

Sorry Kevin, as I mentioned individual glm gave me 5 genes as significant, I put them in a final model like

final <- glm(TRG ~ CDK6 + CXCL8 + IL6 + ISG15 + PTGS2 , data = df, family = binomial(link = 'logit'))

But summary is frustrating, says that only one gene is significant while all 5 genes have been selected individually as significant :(

> summary(final)

Call:
glm(formula = TRG ~ CDK6 + CXCL8 + IL6 + ISG15 + PTGS2, family = binomial(link = "logit"), 
    data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5696  -0.9090   0.2233   0.7125   2.0131  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)   3.2258    13.3877   0.241   0.8096  
CDK6         -2.6101     1.0860  -2.403   0.0162 *
CXCL8         0.8702     0.9783   0.889   0.3737  
IL6           0.9591     0.8883   1.080   0.2803  
ISG15         0.9574     0.5101   1.877   0.0605 .
PTGS2        -0.4623     1.1005  -0.420   0.6744  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 77.347  on 55  degrees of freedom
Residual deviance: 53.917  on 50  degrees of freedom
AIC: 65.917

Number of Fisher Scoring iterations: 6

>

ADD REPLY • link 5.3 years ago by zizigolu ★ 4.3k

1

Entering edit mode

A gene does not require a statistically significant p-value in order for it to provide useful information in a regression model. From the information above, I would say that ISG15 shows a 'trend toward statistical significance'. Did the forward, backward, and 'both' stepwise regression procedures produce the same output?

ADD REPLY • link 5.3 years ago by Kevin Blighe 87k

0

Entering edit mode

No Kevin, I only put each of 23 genes individually in

glm(TRG ~ MYC, data = df, family = binomial(link = 'logit'))

And select ones returning p-value < 0.05 that gave me 5 genes that I put them in a final model

final <- glm(TRG ~ CDK6 + CXCL8 + IL6 + ISG15 + PTGS2 , data = df, family = binomial(link = 'logit'))

where the summary says only CDK6 is statistically significant

To be honest I am not sure how to judge if a gene could remain as a good predictor or not, I just read lower AIC and p-value and for AIC I am not sure based on which threshold I judge this gene could remain in model or not

ADD REPLY • link 5.3 years ago by zizigolu ★ 4.3k

1

Entering edit mode

Sorry, when you create the model with 23 genes, you must perform stepwise regression on that. Please review the code in my original answer:

require(MASS)

null <- glm(TRG ~ 1, data = df, family = binomial(link = 'logit'))
full <- glm(TRG ~ MYC + EGFR + ... + geneZ, data = df, family = binomial(link = 'logit'))

#Step through the variables for stepwise regression
both <- stepAIC(null, scope=list(lower=null, upper=full), direction="both")
forward <- stepAIC(null, scope=list(lower=null, upper=full), direction="forward")
backward <- stepAIC(full, direction="backward")

If you are not sure about this procedure, then please consult a statistician.

ADD REPLY • link 5.3 years ago by Kevin Blighe 87k

1

Entering edit mode

Thank you very much :)

ADD REPLY • link 5.3 years ago by zizigolu ★ 4.3k

0

Entering edit mode

Hi @Kevin. Thank you very much for all your helpful answers. I built a combined model for my data

null <- glm(Group ~ 1, data = df, family = binomial(link = 'logit'))
full <- glm(Group ~ ASPH + ATP5A1 + ATP5G3 + COL24A1 + DCAF10 + DES + DNMT1 +
              EIF2B2 + ETS2 + FLJ38773 + FLNA + HCAR1 + HK2 + HMGB2 + HNRNPDL +
              HSPA1A + HSPE1 + ILF3 + IQGAP3 + IRF2BPL + JCHAIN +
              KPNA2 + LOC102546294 + MAP7D2 + MED31 + MSL1 + OLFM4 +
              PCK1 + PHYKPL + PROSC + PTRF + REG3A + RNF7 + SEC22B +
              SF1 + SLC2A3 + SOX9 + SRI + STRN3 + TM2D1 + TPT1 + UQCRFS1 +
              ZCCHC8 + ZNF638 + ZNF761, data = df, family = binomial(link = 'logit'))

I got these warnings:

Warning messages: 1: glm.fit: algorithm did not converge 2: glm.fit: fitted probabilities numerically 0 or 1 occurred

I searched and came across with this link I changed the code to this:

full <- glm(Group ~ ASPH + ATP5A1 + ATP5G3 + COL24A1 + DCAF10 + DES + DNMT1 +
              EIF2B2 + ETS2 + FLJ38773 + FLNA + HCAR1 + HK2 + HMGB2 + HNRNPDL +
              HSPA1A + HSPE1 + ILF3 + IQGAP3 + IRF2BPL + JCHAIN +
              KPNA2 + LOC102546294 + MAP7D2 + MED31 + MSL1 + OLFM4 +
              PCK1 + PHYKPL + PROSC + PTRF + REG3A + RNF7 + SEC22B +
              SF1 + SLC2A3 + SOX9 + SRI + STRN3 + TM2D1 + TPT1 + UQCRFS1 +
              ZCCHC8 + ZNF638 + ZNF761, data = df, family=binomial,
            control = list(maxit = 50))

The first warning gone but the second one still remained. Could you please tell how I can fix it?

ADD REPLY • link 4.6 years ago by Raheleh ▴ 260

0

Entering edit mode

In layman's terms, it means that there are too many parameters in your model. It could be mitigated by increasing your sample n. Are you aiming to reduce the model parameters?; do you need all of these genes?

ADD REPLY • link 4.6 years ago by Kevin Blighe 87k

0

Entering edit mode

Thank you @Kevin! Yes I want to reduce the predictors via stepwise regression as you said. I have 66 samples (46 non-relapse and 20 relapse). These 45 genes are the genes that I got from differential gene expression analysis. I reduced the number of genes to 24 genes, but still get the same warning. What do you think I should do?

ADD REPLY • link 4.6 years ago by Raheleh ▴ 260

0

Entering edit mode

In that case, you may consider using, instead, a regularised regression, i.e., with the lasso penalty. Have you used that previously?

ADD REPLY • link 4.6 years ago by Kevin Blighe 87k

0

Entering edit mode

Many thanks @Kevin for taking the time to address my questions. No, I've never used it before. I really appreciate if you help me with that.

ADD REPLY • link 4.6 years ago by Raheleh ▴ 260

1

Entering edit mode

Hey, there is a really great tutorial, here: https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html

I have also posted some material on Biostars, so, if you search, you should find it

ADD REPLY • link 4.6 years ago by Kevin Blighe 87k

0

Entering edit mode

Thank you Kevin. I'll go through that. Could you please what is the reason that I got this warning and how lasso penalty will solve it?

ADD REPLY • link 4.6 years ago by Raheleh ▴ 260

1

Entering edit mode

Basically your previous model was likely 'too good' for the data that you have. This issue can occur when you have too many parameters in the model and / or when you have a small dataset.

Regression with the lasso penalty introduces a penalty / weighting for each parameter, which allows it to work with such problematic models as these. I give a very quick overview here: A: How to exclude some of breast cancer subtypes just by looking at gene expressio

ADD REPLY • link 4.6 years ago by Kevin Blighe 87k

0

Entering edit mode

I saw your comprehensive post here about how to build a lasso-penalised model. I did based on your instruction and I got this coefficient for each gene:

$`non-Relapse`
46 x 1 sparse Matrix of class "dgCMatrix"
                       1
(Intercept)   9.97551314
ASPH          .         
ATP5A1        .         
ATP5G3        .         
COL24A1       0.05181047

DCAF10        .         
DES           .         
DNMT1         .         
EIF2B2        .         
ETS2          .         
FLJ38773      .         
FLNA         -0.72130732

HCAR1        -0.08396882

HK2           .         
HMGB2         .         
HNRNPDL       .         
HSPA1A        0.05883314

HSPE1         .         
ILF3          .         
IQGAP3        .         
IRF2BPL       .         
JCHAIN        0.05657491

KPNA2         .         

LOC102546294 -0.09527344

MAP7D2       -0.10777351

MED31         .         
MSL1          .         
OLFM4         .         
PCK1          .         
PHYKPL        .         
PROSC         .         
PTRF          .         
REG3A         .         
RNF7          .         
SEC22B        .         
SF1           .         
SLC2A3       -0.15121214

SOX9          0.02238456

SRI           .         
STRN3         .         
TM2D1         .         
TPT1          .         
UQCRFS1       .         
ZCCHC8        .         
ZNF638        .         
ZNF761        .         

$Relapse
46 x 1 sparse Matrix of class "dgCMatrix"
                       1

(Intercept)  -9.97551314

ASPH          .         
ATP5A1        .         
ATP5G3        .         
COL24A1      -0.05181047

DCAF10        .         
DES           .         
DNMT1         .         
EIF2B2        .         
ETS2          .         
FLJ38773      .         

FLNA          0.72130732

HCAR1         0.08396882

HK2           .         
HMGB2         .         
HNRNPDL       .         
HSPA1A       -0.05883314

HSPE1         .         
ILF3          .         
IQGAP3        .         
IRF2BPL       .         
JCHAIN       -0.05657491

KPNA2         .         
LOC102546294  0.09527344

MAP7D2        0.10777351

MED31         .         
MSL1          .         
OLFM4         .         
PCK1          .         
PHYKPL        .         
PROSC         .         
PTRF          .         
REG3A         .         
RNF7          .         
SEC22B        .         
SF1           .         
SLC2A3      0.15121214

SOX9         -0.02238456

SRI           .         
STRN3         .         
TM2D1         .         
TPT1          .         
UQCRFS1       .         
ZCCHC8        .         
ZNF638        .         
ZNF761        .

These are the result based on metrics: min λ (lamba) . sorry for naive question. I am new to this field. could you please tell me what is the meaning of those dots? In your post you mentioned that "The best predictors will generally be those that have the smallest (possibly zero) coefficient values". You provided two coefficients "co" and "co_se". based on which one we have to select the best predictor? Many thanks!

ADD REPLY • link updated 4.6 years ago by GenoMax 141k • written 4.6 years ago by Raheleh ▴ 260

0

Entering edit mode

Hi Kevin Blighe

I have one question, related to the above answer. After cross-validation If I don't have any independent datasets to perform model prediction, will the prediction be valuable and publishable?

Thanks

ADD REPLY • link 2.9 years ago by sontiroy • 0

1

Entering edit mode

Should be able to publish it... somewhere. If you're in a reputable group, you can publish even dud [yet sensationalistic] stuff with Nature Publishing Group. In the case where no external dataset is available, one should split the current dataset into training and validation, although the cross validation procedure is essentially doing this for you anyway.

ADD REPLY • link 2.9 years ago by Kevin Blighe 87k