How to do survival analysis?
1
0
Entering edit mode
3.3 years ago
wenbinm ▴ 40

Hi there,

I would like to find genes correlated with poor prognosis. I am doing a simple survival analysis:

1. divide patients into two groups by gene expression (using median as cutoff).
2. find genes significantly correlated with overall survival time (using coxph function in R).
3. check whether my list of genes are up or down regulated in cancer samples compared to normal samples.
4. finding genes with hazard ratio larger than 1 (low expression group lives longer) that are up regulated in cancer sample and also genes with hazard ratio smaller than 1 that are down in tumors.

Am I doing it right? Is the 4th step necessary? Must the genes with hazard ratio larger than 1 be up regulated in tumor compared to normal tissue (or the hazard ratio won't make any sense)?

Thank you!

prognosis survival analysis cancer • 1.8k views
1
Entering edit mode
2.8 years ago

With survival analysis using gene expression data, there are many possible ways to do it. Your method seems to be fine, generally.

Just some words of advice: you cannot really just focus on genes with HR greater or less than 1. You also have to accompany these with a statistically significant p value. Usually the log rank p-value is chosen. You also should check the lower and upper confidence intervals (CIs) (at least at 95% confidence level). If you have HR = 1.5, for example, but the lower CI is 0.7, then this will likely not have a statistically significant p-value.

Also, using the word 'up-regulated' from the HRs is not common. Up-regulation and down-regulation are more spoken in the realms of differential expression analysis. With survival, you can just say things like 'the gene's expression results in a higher risk of MyDisease (HR (95% CI): X (Y, Z); p=0.0005)'.

I posted a tutorial that will likely assist you: Survival analysis with gene expression

Kevin

2
Entering edit mode

I agree with the answer but I would just add that you need to take into account multiple testing! If you are testing all genes to see if they correlate with survival you are doing 20k hypothesis tests. You need the probability of finding something "statistically significant" just by chance without a real relationship between the gene and survival is very high. You need to correct for multiple testing to take that into account.

1
Entering edit mode

That is indeed correct, bernatgel. Thanks! ¡Gracias!