Question

Building a predictive model by a list of genes and survival information

1

Entering edit mode

22 months ago

enriqp02 ▴ 30

Hello everybody. I'm doing my PhD and I'm a bit new in this field. I have the gene expression of a group of patients before treatment and after treatment. I have performed a differential gene analysis and found 1740 overexpressed genes.

After doing this, I filtered out those genes in the patient samples "Before the treatment", with the idea of seeing which of those genes that are overexpressed have an effect in the survival of the patients.

For this I have this kind of table:

  Patient | OS(months) | Death | P2RY8 | BRAF | ...
  P1          12          1     7.96891  6.9009 
  P2          32          0     7.51238  6.39389
  P3          22          1     7.51238  7.39389
  P4          32          0     6.96891  4.9009 
  P5          24          1     5.96891  3.9009 
  P6          33          1     6.96891  6.700

Those numbers correspond to the expression levels after normalization with RMA.

I have seen this post by Kevin Blighe , but I don't want to see the effect that each gene has invidually, but a multivariate analysis.

Survival analysis with gene expression

Do you have any idea how I can do this or if there are any tutorials?

Thank u in advance.

predictive genes model survival • 876 views

ADD COMMENT • link updated 22 months ago by Jeremy ▴ 910 • written 22 months ago by enriqp02 ▴ 30

score 3 · Answer 1 · 2022-06-22

To make a predictive model, you will first want to split your data into train and test sets. You can then use the training data to try random forest, regression, gradient boosting, and parametric survival regression models.

In R:

library(randomForest)
library(gbm)
library(caret)
library(survival)

set.seed(22)

survival = read.csv('survival.csv')

rf.model = randomForest(OS ~ . - Patient - Death, data = survival)
varImp(rf.model)

reg.model = glm(OS ~ . - Patient - Death, data = survival)
varImp(reg.model)

gbm.model = gbm(OS ~ . - Patient - Death, data = survival, n.minobsinnode = 1)
summary.gbm(gbm.model)

surv.model = survreg(Surv(OS) ~ . - Patient - Death, data = survival)
summary(surv.model)

Note that I excluded Patient and Death from the OS model since Patient is irrelevant to OS and Death is too similar. You just want to use gene expression to predict survival. You can use the same methods to predict Death, just indicate family = 'binomial' in the glm model. In that case, you will want to exclude OS from the model. n.minobsinnode in the gbm model will probably need to be optimized.

Once you have your models, you can use the test data to make predictions, which you can then compare to the actual values. You can use error (e.g. root mean square error (RMSE) or mean absolute error (MAE)) or a confusion matrix to compare the four models. RMSE would be for the OS models, and a confusion matrix would be for the Death models.

After picking random forest, regression, gradient boosting, or parametric survival regression, you can then optimize that model.

For information on looking at interaction effects between variables, see the previous post below:

A first trial...

I'm not sure about online tutorials, but the book "An Introduction to Statistical Learning" by James et al. should tell you everything you need to know. It has an entire chapter dedicated to survival analysis.