question about survival analysis using TCGA dataset
1
0
Entering edit mode
4.3 years ago
tujuchuanli ▴ 100

Hi,all I want to performe survival analysis to predict clinical outcome using genes from my gene set in TCGA data set (maybe I can call my gene set here as gene signature). Since I get my gene set by analyzing newly downloaded TCGA gene expression data, I want to performe survival analysis using the matched clinical data and not prefer to use available online tools (they may miss some important samples).

I read some papers which did the same things. I find that they may add some clinical parameters into survival analysis. For example, this paper (https://peerj.com/articles/1499/) added the age and tumor_grade into the survival analysis.

Should I add some clinical parameters as they did? or just use expression value?

survival analysis • 2.0k views
0
Entering edit mode

If you are interested in knowing whether any of the clinical parameters might act as confounding variable or have some effect on survival, then you can include them. In anyway you can compare the survival rate between two studies, one with and without including the clinical parameter.

0
Entering edit mode

Thanks for answering me. It help me a lot!

0
Entering edit mode
4.3 years ago

You should only adjust for age and tumour grade in your survival models if you believe that they are important factors to whatever your hypotheses may be. To quote the authors:

We were interested in the effect a gene has on prognosis independent of factors such as tumor grade and age of a patient.

So, they adjusted for age and tumour grade specifically because they were the focus of their study, i.e., they obviously had the belief that age and tumour grade would confound the effect that a gene's expression has on prognosis, which makes sense. They also appear to have included gender in each model, which is not relevant for all cancers, of course, even though, for breast cancer, there are some male breast cancer patients in the TCGA BRCA cohort.

From what I gather, they built an independent Cox proportional hazards model for each gene, and in each case they included age, gender, and tumour grade, but the included covariates varied for different cancers. They then obtained the p-values for each gene and clustered samples using the top 100 genes (Figure 1). The survival curves that appear in Figure 1 are actually just based on the clusters that they identify in this clustering. From the Cox models, they also obtained the Beta coefficients and did further work with these.

To help you, Cox proportional hazards is implemented in R via the coxph() function. I have put some code for doing this already on some Biostars posts:

Kevin