Question

determining cutoff for Kaplan Meier

0

Entering edit mode

8.2 years ago

XD ▴ 10

I am analyzing gene X expression in the context of overall survival for a TCGA dataset. I want to take a data driven approach which determines the optimal cutoff for maximum significance between arms (high and low).

Is this approach acceptable and what kind of biases am I working with? I've seen numerous papers with this type of approach for determining cutoffs for KM survival analysis... but I know that there are other options for determining cutoffs such as median or quartile extremes.. or Cox instead of KM (but I really don't consider my circumstance to be a continuous variable).

Also, if I continue with the optimal cutoff... can I do permutation testing to see if it is real? What would my null be to test against... randomized gene expression values while keeping cohort size the same... randomized gene expression values with new optimal cutoffs determined (and allowing cohort size to change)...?

Thanks in advance!

kaplan meier survival • 11k views

ADD COMMENT • link updated 6.9 years ago by Tom_L ▴ 350 • written 8.2 years ago by XD ▴ 10

0

Entering edit mode

I also have similar questions, I downloaded data from xenophobic browser that hosted TCGA data, when I want to compare high or low expression, it seems difficult to classify.

ADD REPLY • link 6.9 years ago by zany1983 ▴ 10

0

Entering edit mode

This is not an answer to the question. I'm moving it to a comment.

ADD REPLY • link 6.9 years ago by Jean-Karim Heriche 27k

score 0 · Answer 1 · 2017-06-12

0

Entering edit mode

6.9 years ago

SamGG ▴ 20

Hi,

I'm not sure I will answer your question. In a first approach, I split the experimental data (gene expression) according the quartiles leading to 3 groups: samples with levels below 25th percentile, higher than 75th percentile and samples in between. From that grouping I get a KM plot and p-value. In a second approach, I use the maxstat package as nicely described at http://r-addict.com/2016/11/21/Optimal-Cutpoint-maxstat.html. IMHO, a relevant cut point must be between the 20th and 80th percentiles if the experiment design is roughly balanced.

HTH

ADD COMMENT • link 6.9 years ago by SamGG ▴ 20

0

Entering edit mode

Which is more suitable for TCGA data where you don't really have control over the number of patients that fall into each quartile/cut-point bin?

ADD REPLY • link 6.2 years ago by freuv ▴ 20

score 0 · Answer 2 · 2017-06-12

Considering gene expression, you should primarily rely on unsupervised approaches such as mean or median split (commonly used). However, I would not recommend the median split since you arbitrary split your cohort in half and I guess that not exactly 50% of patients will survive in your analysis.

Independently of this result, I would recommend to perform a differential expression analysis to see how your gene performs compared to others and ask if there could be a connection between the top differentially expressed genes and yours (same pathway)? Another alternative would be to investigate all genes by survival analysis. Also, I would recommend performing some multivariate analysis with your gene versus other interesting clinical information having a significant impact on patient survivals: tumour grade, size, chemotherapy, radiation therapy, etc.

Depending on your sample size (>200), you can consider generating thousands of random sub-sampling (75/66/50% of you total cohort) and perform similar analysis. How many random trials achieve with a significant survival difference and how bad is the P compared to the total cohort (due, in part, to the loss of statistical power associated to the sub-sampling). This will indicate how robust your expression classification is.

Lastly, you can test multiple (if not all) cut-offs. Is there a significant value and, if yes, why this specific value? Can you subset your cohort based on this value and see if another gene expression or clinical information fits this survival difference?

The approach you describe make sense, you will not find the solution with a single test.

Hope this helps.

Cheers.