determining cutoff for Kaplan Meier
2
0
Entering edit mode
8.2 years ago
XD ▴ 10

I am analyzing gene X expression in the context of overall survival for a TCGA dataset. I want to take a data driven approach which determines the optimal cutoff for maximum significance between arms (high and low).

Is this approach acceptable and what kind of biases am I working with? I've seen numerous papers with this type of approach for determining cutoffs for KM survival analysis... but I know that there are other options for determining cutoffs such as median or quartile extremes.. or Cox instead of KM (but I really don't consider my circumstance to be a continuous variable).

Also, if I continue with the optimal cutoff... can I do permutation testing to see if it is real? What would my null be to test against... randomized gene expression values while keeping cohort size the same... randomized gene expression values with new optimal cutoffs determined (and allowing cohort size to change)...?

Thanks in advance!

kaplan meier survival • 11k views
ADD COMMENT
0
Entering edit mode

I also have similar questions, I downloaded data from xenophobic browser that hosted TCGA data, when I want to compare high or low expression, it seems difficult to classify.

ADD REPLY
0
Entering edit mode

This is not an answer to the question. I'm moving it to a comment.

ADD REPLY
0
Entering edit mode
6.9 years ago
SamGG ▴ 20

Hi,

I'm not sure I will answer your question. In a first approach, I split the experimental data (gene expression) according the quartiles leading to 3 groups: samples with levels below 25th percentile, higher than 75th percentile and samples in between. From that grouping I get a KM plot and p-value. In a second approach, I use the maxstat package as nicely described at http://r-addict.com/2016/11/21/Optimal-Cutpoint-maxstat.html. IMHO, a relevant cut point must be between the 20th and 80th percentiles if the experiment design is roughly balanced.

HTH

ADD COMMENT
0
Entering edit mode

Which is more suitable for TCGA data where you don't really have control over the number of patients that fall into each quartile/cut-point bin?

ADD REPLY
0
Entering edit mode
6.9 years ago
Tom_L ▴ 350

Considering gene expression, you should primarily rely on unsupervised approaches such as mean or median split (commonly used). However, I would not recommend the median split since you arbitrary split your cohort in half and I guess that not exactly 50% of patients will survive in your analysis.

Independently of this result, I would recommend to perform a differential expression analysis to see how your gene performs compared to others and ask if there could be a connection between the top differentially expressed genes and yours (same pathway)? Another alternative would be to investigate all genes by survival analysis. Also, I would recommend performing some multivariate analysis with your gene versus other interesting clinical information having a significant impact on patient survivals: tumour grade, size, chemotherapy, radiation therapy, etc.

Depending on your sample size (>200), you can consider generating thousands of random sub-sampling (75/66/50% of you total cohort) and perform similar analysis. How many random trials achieve with a significant survival difference and how bad is the P compared to the total cohort (due, in part, to the loss of statistical power associated to the sub-sampling). This will indicate how robust your expression classification is.

Lastly, you can test multiple (if not all) cut-offs. Is there a significant value and, if yes, why this specific value? Can you subset your cohort based on this value and see if another gene expression or clinical information fits this survival difference?

The approach you describe make sense, you will not find the solution with a single test.

Hope this helps.

Cheers.

ADD COMMENT

Login before adding your answer.

Traffic: 2925 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6