Question: TCGA survival analysis: continuous vs discrete expression values
0
Mike1.3k wrote:

Hi ,

I am trying to do a survival analysis for a gene using TCGA data, I did this by both ways, continuous expression value and discrete values (Low and high using median expression values). In both cases there is huge difference in p-values. Can anyone help me which way is better for survival analysis?

my command:

``````coxph(Surv(time, status) ~ expression, data = survdata)
``````

results:

``````HR=0.82,  logrankP= 0.02  (when I used discrete model)
HR= 0.87,  logrankP= 0.00001  (when I used continuous model)
``````

Thanks

survival cox model coxph • 274 views
written 4 months ago by Mike1.3k
2
Kevin Blighe44k wrote:

When you convert the data to discrete values, you are eliminating information, as I elaborate here in an extreme example: A: Why quantitative design are preferred GWAS approach In the process, you also make it more readily interpretive to the human brain. Simply using `Low` and `High` may be too few categories. You could try introducing more categories.

If your data is on the continuous scale, you need to be aware of the distribution that it follows and whether you have processed it correctly.

Thanks Kevin, expression data is RSEM log2 and this is distribution.

https://ibb.co/pbWBV0g

Median expresssion values of this gene is 8.73 in 452 samples

What if you convert that logged data to Z-scores and then trichotomise it based on that?

nearly the same results using Z-scores data for discrete (logrankP= 0.02 ) & continuous model (logrankP= 0.00004 ).

1

You should check hazard ratios too, and their confidence intervals. If, in one situation, the hazard ratio is 0.6 but the upper 95% limit passes 1.0, then that is not as reliable as a situation where the upper 95% is 0.8. Same is true for the reverse where the hazard ratio may be 2.9 but the lower 95% limit is below or maintained above (1.0).

That is: check that the hazard ratio limits don't cross the 'barrier' of 1.0. It's just a simple extra check.

Thanks again, yes there is difference in HRs with confidence intervals (upper/lower 95)

``````HR         HRlower   HRupper
0.82      0.770        1.01       (discrete)
0.87      0.75         0.97   (continues)
``````

Looking at that, I'd assume that continuous was more reliable. I think that it's okay to derive the p-value and HRs from the continuous variable and then just plot dichotomised variables in the survival plot. You just have to clearly state what you have done in the methods.

1

Thanks Kevin for your help, I found a relevant article on this issue.

Comparing continuous and discrete analyses of breast cancer survival information

https://www.sciencedirect.com/science/article/pii/S0888754316300684