Hazard ratio for many genes (gene signature)
1
0
Entering edit mode
15 months ago

I noticed that the Gepia2 web server also accepts more than a single gene as input to calculate log-rank and hazard ratio for TCGA data as if there was only a single gene. A "High" and a "Low" group are internally assigned to the samples by the server. I wondered if the following code/approach is the/a correct way to obtain the combined effect of 2 genes in survival analysis or some other methods are present out there. I use a new "virtual" gene with expresion values equal to the mean of the genes of interest.

library(survival)
set.seed(1)
first.gene= rnorm(300); z_ first.gene= scale(first.gene)
second.gene= rnorm (300);  z_ second.gene= scale(second.gene)
# Averaging genes for each sample.
dummy.gene= apply(cbind(first.gene,second.gene),1,mean)
# or is this better to be done on z-scores? "Making expression values comparable"
#dummy.gene_z= apply(cbind(z_first.gene,z_second.gene),1,mean)
g= dummy.gene> median(dummy.gene)
dat= data.frame(rats, g)
s= Surv( dat$time, dat$status)
coxph(s~g, data=dat)


I think we should also check whether the two genes are not perfectly negatively correlated.

survival analysis R • 732 views
1
Entering edit mode
15 months ago

I'm not really an expert on this, but I'm pretty sure you can do coxph(s~first.gene + second.gene) Note that first.gene + second.gene isn't saying they should be added together, but that they are both predictors of s.

0
Entering edit mode

Dear i.sudbery ,

Thank you for your response. I am cognizant of the fact you mentioned; however, actually I am not sure about the very terminology I used for the question.

1
Entering edit mode

I'm sorry, then I don't understand. Why calculate a mean for the two genes if you can put both genes independently into the model?

0
Entering edit mode

Thank you. My actual problem is when in genomic studies, we obtain _say_ 100 upregulated and 100 down-regulated genes. How can one predict or find an estimate of, the effect of all these genes on survival?

1
Entering edit mode

Test each independently, at first, and then proceed from that point.

0
Entering edit mode

Thank you Dr. Blighe. I have actually tested the genes independently on cancer patients. For genes with significant log-rank test result (after FDR) some down-regulated genes show hazard ratios>1 (and some up-regulatd show HR<1), so their de-regulation is supposed to lengthen the lives of patients. However, hazard ratio p values are not significant for these "beneficially" de-regulated genes. I "feel" that the combination of these 100 genes might have significant HR, at least if they are not highly correlated in patients. I wanted to test the hypothesis, and already had the GEPIA2 webserver do the kind of task for me. I was curious about the underlying approach.

1
Entering edit mode

Would you not do:

   # 100 genes in 100 patients
simulated.genes <-  matrix(rnorm(100*100), nrows=100, ncols=100))
colnames(simulated.genes) <- paste0("gene", 1:100)
rownames(simulated.genes) <- paste0("patient", 1:100)

# divide into high or low expression
gene.status <- apply(simulated.genes, 2, function(x) x > median(x))

s <- Surv(rats$time, rats$status)
dat =<- data.frame(s, gene.status)

coxph(s ~ ., data=dat)


This should fit a model using all 100 genes, combining them in the proportions and ways that leads to the best prediction of survival I would guess this is what GEPIA2 is doing. As Keven points out, there are a million ways to divide the genes into high and low expression groups.

Alternatively you could do some sort of eigengene analysis, where you extract an general pattern across the genes. You might do this by taking the first principle component (or first two etc) of the expression matrix and using that as your predictor.

0
Entering edit mode

How may I summarize the resulting 100 coefs/HR? May I "add" coef of 100 genes and then use exp() (if I do the analysis on z-scores)? Or use glmnet to predict the effect of simultaneous chang of the 100 genes? Eigengene analysis also sounds great. I wondered if it is plausible to use WGCNA package out of box, for censored survival data, as WGCNA clusters genes, and creates an eigengene for each cluster, that should at least help with p-value adjustment.

1
Entering edit mode

You can use Z-scores, binary values separated by median, quartiles, quantiles, the original expression values, or any combination of these. There is no right nor wrong, really, but how you interpret the results may then change.

If your data is skewed, then obviously you need to address that.

You should also check for co-correlation.