Survival analysis for a list of genes using TCGA data
1
0
Entering edit mode
2.7 years ago
Mike ★ 1.7k

Hi all,

I have a list of genes (around 300 genes) and I want survival analysis to find only significant genes. I am using TCGA RSEM normalized data in survival package using following command, but I'm not sure how to choose significant genes.

coxph(Surv(OS_MONTHS, Events) ~., data=merged_data [, c(2, 4:303)])


output...

                              coef    exp(coef)           se(coef)    z    p
Gene1                0.07092   1.07349  0.05348 1.33 0.18
Gene2               0.02332   1.02360  0.05813 0.40 0.69
Gene3               0.00175   1.00175  0.06734 0.03 0.98
Gene n.....


Thanks

Survival coxph TCGA Survival analysis • 2.6k views
2
Entering edit mode

Hey Mike, usually, one takes the log rank p-value. The Hazard Ratio is specified by exp(coef). Have you included all of your genes in the model? It may be better to test each independently and then, perhaps, construct a final model with just those genes that are statistically significant. I, of course, have a tutorial: Survival analysis with gene expression

0
Entering edit mode

Thanks Kevin, as always your help and codes are really very helpful. Yes I included all genes. you can see my comand:

coxph(Surv(OS_MONTHS, Events) ~., data=merged_data [, c(2, 4:303)])


Is it correct? Sorry for again.. , what do you mean by test each independently and then construct a final model with selected genes? meantime I am installing your package RegParallel and will follow your protocol.

1
Entering edit mode

Okay. If you use Windows, then select a low number of cores with RegParallel. It runs on Windows but more efficient on Linux / Mac. The survival part is also in the vignette: https://github.com/kevinblighe/RegParallel#survival-analysis-via-cox-proportional-hazards-regression

There is a difference between testing each gene independently and including them all in the same model from the beginning. When your formula is this:

Surv(OS_MONTHS, Events) ~ gene1 + gene2 + gene3)


...then, the model interprets this as an additive effect between the 3 genes. The genes are 'adjusting' for each other's effects (for a better description, could talk to a statistician). The p-values that you get will differ from when you do:

Surv(OS_MONTHS, Events) ~ gene1)
Surv(OS_MONTHS, Events) ~ gene2)
Surv(OS_MONTHS, Events) ~ gene3)


The RegParallel code will test each variable independently for you and then piece the results together into a single table. This is effectively the same as what EdgeR, DESeq2, and limma are doing, too, i.e., they test each gene independently.

Once you have identified some key genes, then you could still just report them as independently statistically significantly associated with survival. Note the difference in the survival curve when you plot an independent model versus the additive model.

0
Entering edit mode

Thank you very much, for RegParallel I need to upgrade my R (I have R version 3.3.3).

0
Entering edit mode

Yes, this is a requirement from Bioconductor. When a package is being developed, it has to use the latest version. RegParallel was only accepted last month, so, it requires R 3.5. If you need to definitively use it, just obtain the development version direct from GitHub.

1
Entering edit mode

I have now removed the R version requirement for this in the development version. You can install it with:

devtools::install_github('kevinblighe/RegParallel')

0
Entering edit mode

It works perfectly, thank you very much!

0
Entering edit mode

In doing this, it would be a good idea to update all of your current packages with:

source("https://bioconductor.org/biocLite.R")
biocLite()


or

BiocManager::install()

0
Entering edit mode
7 weeks ago

Hi! now you can do this task very easily with my recently developed tool named 'geneSA' (https://github.com/huynguyen250896/geneSA). Its output will automatically report genes statistically significant with survival outcome. Give it a try ;)