Question: How to do survival analysis?
gravatar for wenbinm
18 months ago by
wenbinm10 wrote:

Hi there,

I would like to find genes correlated with poor prognosis. I am doing a simple survival analysis:

  1. divide patients into two groups by gene expression (using median as cutoff).
  2. find genes significantly correlated with overall survival time (using coxph function in R).
  3. check whether my list of genes are up or down regulated in cancer samples compared to normal samples.
  4. finding genes with hazard ratio larger than 1 (low expression group lives longer) that are up regulated in cancer sample and also genes with hazard ratio smaller than 1 that are down in tumors.

Am I doing it right? Is the 4th step necessary? Must the genes with hazard ratio larger than 1 be up regulated in tumor compared to normal tissue (or the hazard ratio won't make any sense)?

Thank you!

ADD COMMENTlink modified 11 months ago by Kevin Blighe54k • written 18 months ago by wenbinm10
gravatar for Kevin Blighe
11 months ago by
Kevin Blighe54k
Kevin Blighe54k wrote:

With survival analysis using gene expression data, there are many possible ways to do it. Your method seems to be fine, generally.

Just some words of advice: you cannot really just focus on genes with HR greater or less than 1. You also have to accompany these with a statistically significant p value. Usually the log rank p-value is chosen. You also should check the lower and upper confidence intervals (CIs) (at least at 95% confidence level). If you have HR = 1.5, for example, but the lower CI is 0.7, then this will likely not have a statistically significant p-value.

Also, using the word 'up-regulated' from the HRs is not common. Up-regulation and down-regulation are more spoken in the realms of differential expression analysis. With survival, you can just say things like 'the gene's expression results in a higher risk of MyDisease (HR (95% CI): X (Y, Z); p=0.0005)'.

I posted a tutorial that will likely assist you: Survival analysis with gene expression


ADD COMMENTlink written 11 months ago by Kevin Blighe54k

I agree with the answer but I would just add that you need to take into account multiple testing! If you are testing all genes to see if they correlate with survival you are doing 20k hypothesis tests. You need the probability of finding something "statistically significant" just by chance without a real relationship between the gene and survival is very high. You need to correct for multiple testing to take that into account.

ADD REPLYlink written 11 months ago by bernatgel2.4k

That is indeed correct, bernatgel. Thanks! ¬°Gracias!

ADD REPLYlink written 11 months ago by Kevin Blighe54k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1100 users visited in the last hour