Building a predictive model by a list of genes and survival information
Entering edit mode
4 days ago
enriqp02 ▴ 30

Hello everybody. I'm doing my PhD and I'm a bit new in this field. I have the gene expression of a group of patients before treatment and after treatment. I have performed a differential gene analysis and found 1740 overexpressed genes.

After doing this, I filtered out those genes in the patient samples "Before the treatment", with the idea of seeing which of those genes that are overexpressed have an effect in the survival of the patients.

For this I have this kind of table:

  Patient | OS(months) | Death | P2RY8 | BRAF | ...
  P1          12          1     7.96891  6.9009 
  P2          32          0     7.51238  6.39389
  P3          22          1     7.51238  7.39389
  P4          32          0     6.96891  4.9009 
  P5          24          1     5.96891  3.9009 
  P6          33          1     6.96891  6.700

Those numbers correspond to the expression levels after normalization with RMA.

I have seen this post by Kevin Blighe , but I don't want to see the effect that each gene has invidually, but a multivariate analysis.

Survival analysis with gene expression

Do you have any idea how I can do this or if there are any tutorials?

Thank u in advance.

predictive genes model survival • 273 views
Entering edit mode
4 days ago
Jeremy ▴ 340

To make a predictive model, you will first want to split your data into train and test sets. You can then use the training data to try random forest, regression, gradient boosting, and parametric survival regression models.

In R:



survival = read.csv('survival.csv')

rf.model = randomForest(OS ~ . - Patient - Death, data = survival)

reg.model = glm(OS ~ . - Patient - Death, data = survival)

gbm.model = gbm(OS ~ . - Patient - Death, data = survival, n.minobsinnode = 1)

surv.model = survreg(Surv(OS) ~ . - Patient - Death, data = survival)

Note that I excluded Patient and Death from the OS model since Patient is irrelevant to OS and Death is too similar. You just want to use gene expression to predict survival. You can use the same methods to predict Death, just indicate family = 'binomial' in the glm model. In that case, you will want to exclude OS from the model. n.minobsinnode in the gbm model will probably need to be optimized.

Once you have your models, you can use the test data to make predictions, which you can then compare to the actual values. You can use error (e.g. root mean square error (RMSE) or mean absolute error (MAE)) or a confusion matrix to compare the four models. RMSE would be for the OS models, and a confusion matrix would be for the Death models.

After picking random forest, regression, gradient boosting, or parametric survival regression, you can then optimize that model.

For information on looking at interaction effects between variables, see the previous post below:

A first trial...

I'm not sure about online tutorials, but the book "An Introduction to Statistical Learning" by James et al. should tell you everything you need to know. It has an entire chapter dedicated to survival analysis.

Entering edit mode

Thank u so much for your reply. I'll try it with my data and let you know ;)

Entering edit mode

You're welcome! You might also want to consider censoring your data. See the blog below for info.:

Basics of Survival Analysis

Entering edit mode

Here are some online workflows that might help. The first one uses the caret package in R.

Prediction of Cancer Survival

Random Forest: METABRIC


Login before adding your answer.

Traffic: 1683 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6