This is kind of a general question that I have been curious about for a while. Please consider my example:
I have a gene that seems like it should be associated with pro-cancer effects based on the literature.
I go ahead and mine rna-seq samples from cancer patients and code them as "high expressers" or "low expressers" of my gene.
I do some survival analysis using Univariate Cox regression and find out high expression of my gene is associated with significantly reduced overall survival.
My question is how do I actually know that this gene is an independent prognosticator and that this is not a correlation does not equal causation trap?
How do I know that when selecting my population of "high expressers" of my gene of interest I am not inadvertently over-representing some other gene that is a much more powerful and independent prognosticator?