Question: determining cutoff for Kaplan Meier
gravatar for XD
4.9 years ago by
XD10 wrote:

I am analyzing gene X expression in the context of overall survival for a TCGA dataset. I want to take a data driven approach which determines the optimal cutoff for maximum significance between arms (high and low).

Is this approach acceptable and what kind of biases am I working with? I've seen numerous papers with this type of approach for determining cutoffs for KM survival analysis... but I know that there are other options for determining cutoffs such as median or quartile extremes.. or Cox instead of KM (but I really don't consider my circumstance to be a continuous variable).

Also, if I continue with the optimal cutoff... can I do permutation testing to see if it is real? What would my null be to test against... randomized gene expression values while keeping cohort size the same... randomized gene expression values with new optimal cutoffs determined (and allowing cohort size to change)...?

Thanks in advance!

survival kaplan meier • 6.5k views
ADD COMMENTlink modified 3.6 years ago by Tom_L340 • written 4.9 years ago by XD10

I also have similar questions, I downloaded data from xenophobic browser that hosted TCGA data, when I want to compare high or low expression, it seems difficult to classify.

ADD REPLYlink written 3.6 years ago by zany198310

This is not an answer to the question. I'm moving it to a comment.

ADD REPLYlink written 3.6 years ago by Jean-Karim Heriche24k
gravatar for SamGG
3.6 years ago by
SamGG20 wrote:


I'm not sure I will answer your question. In a first approach, I split the experimental data (gene expression) according the quartiles leading to 3 groups: samples with levels below 25th percentile, higher than 75th percentile and samples in between. From that grouping I get a KM plot and p-value. In a second approach, I use the maxstat package as nicely described at IMHO, a relevant cut point must be between the 20th and 80th percentiles if the experiment design is roughly balanced.


ADD COMMENTlink written 3.6 years ago by SamGG20

Which is more suitable for TCGA data where you don't really have control over the number of patients that fall into each quartile/cut-point bin?

ADD REPLYlink written 3.0 years ago by freuv20
gravatar for Tom_L
3.6 years ago by
Tom_L340 wrote:

Considering gene expression, you should primarily rely on unsupervised approaches such as mean or median split (commonly used). However, I would not recommend the median split since you arbitrary split your cohort in half and I guess that not exactly 50% of patients will survive in your analysis.

Independently of this result, I would recommend to perform a differential expression analysis to see how your gene performs compared to others and ask if there could be a connection between the top differentially expressed genes and yours (same pathway)? Another alternative would be to investigate all genes by survival analysis. Also, I would recommend performing some multivariate analysis with your gene versus other interesting clinical information having a significant impact on patient survivals: tumour grade, size, chemotherapy, radiation therapy, etc.

Depending on your sample size (>200), you can consider generating thousands of random sub-sampling (75/66/50% of you total cohort) and perform similar analysis. How many random trials achieve with a significant survival difference and how bad is the P compared to the total cohort (due, in part, to the loss of statistical power associated to the sub-sampling). This will indicate how robust your expression classification is.

Lastly, you can test multiple (if not all) cut-offs. Is there a significant value and, if yes, why this specific value? Can you subset your cohort based on this value and see if another gene expression or clinical information fits this survival difference?

The approach you describe make sense, you will not find the solution with a single test.

Hope this helps.


ADD COMMENTlink written 3.6 years ago by Tom_L340
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2060 users visited in the last hour