Hi,

I am conducting a survival analysis using the gene expression values (converted to z-scores) from a certain gene and event free survival time for 249 patients. I am a newbie to this, so please excuse my lack of expertise (this is also my second question on here, so sorry for any formatting errors, I'm still figuring this out).

Here is my code:

finaldata$SurvObj <- with(finaldata, Surv(efst, status == 2))

finaldata$SurvObj

res.cox1 <- coxph(SurvObj ~ Zscore, data = finaldata)

summary(res.cox1)

And here is the output:

```
Call:
coxph(formula = SurvObj ~ Zscore, data = finaldata)
n= 247, number of events= 140
(2 observations deleted due to missingness)
coef exp(coef) se(coef) z Pr(>|z|)
Zscore 0.5878 1.8001 0.2118 2.776 0.0055
---
Signif. codes: 0 0.001 0.01 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
Zscore 1.8 0.5555 1.189 2.726
Concordance= 0.561 (se = 0.025 )
Likelihood ratio test= 7.92 on 1 df, p=0.005
Wald test = 7.71 on 1 df, p=0.006
Score (logrank) test = 7.74 on 1 df, p=0.005
```

efst is the event free survival time for patients, the status is whether they are dead or alive (1 is alive, 2 is dead), and Zscore is the Z-Score value for the expression of the gene in question for each of the patients (to understand whether the expression is relatively high or low).

However, I find it highly improbable that I am receiving such p-values and that leads me to think that I have definitely done something incorrectly since this is my first time trying this out.

Please let me know whether you are able to find any errors I have made/how to fix them, or any help at all with how to better interpret this data. I appreciate it!

Thank you so much for your response, Dr. Blighe! I had a question -- so originally, the data that I had was RMA (robust multiarray average) transformed data that was from an Affymetrix microarray for only the tumor tissue samples for 249 neuroblastoma patients (no normal tissue). I then calculated the Z-Scores for each patient using the expression value of the gene in question (NEK2) for that patient, subtracting the mean (from all the genes from all the patients), and dividing by standard deviation (once again, from all genes of all patients). Is there perhaps a flaw with how I calculated my Z-score? I then used those z-score values, event free survival time, and state in my regression method. This is for a school project, and I am still learning as I go along, so please excuse me for all the questions (I am feeling a bit lost). If the above data is right, then that means my hazard ratio is 1.8. How would I interpret this relative to the expression level of the gene? Thank you for your time!

Calculating

globalZ-scores, like you have done, is no problem. Sometimes they are calculated independently on a per-gene basis, though.With any type of regression, though, you need to be mindful of outliers, which can give misleading results.

How do the first few rows of your input data,

`finaldata`

, actually appear?; How is`state`

encoded?As you are using a continuous scale for the Cox model predictors, the interpretation of a HR=1.8. is not readily intuitive. The beta coefficient, 0.5878, from which the HR is calculated, relates to the difference between your groups / strata based on a unit change [i.e. value of 1] in your input expression level. Put another way, if we increase the Z-score of the expression by 1, how does group

`x`

change with respect to group`y`

. The HR is then the exponent of the beta coefficient.Hi Dr Blighe, That makes some more sense now. I have re-done my regression model using just the RMA transformed data (not z-scores) because RMA data is already skewed and log transformed (https://support.bioconductor.org/p/50480/) so I don't think I need to calculate z-scores (please correct me if I'm wrong), but I was wondering how I would be able to test all the genes for all 249 patients using Cox at once and filter them for p value < 0.05 and HR > 1. I saw your extremely helpful tutorial (https://www.biostars.org/p/344233/), but I am not sure how I would follow this and use your RegParallel package since my data is in 249 separate txt files (like this: ftp://caftpd.nci.nih.gov/pub/OCG-DCC/TARGET/NBL/gene_expression_array/L3/gene/Full/chla.org_NBL.HumanExon.Level-3.BER.full_gene.TARGET-30-PAAPFA-01A-01R.txt) and not a GEO dataset like the one that you used. I would appreciate any tips or steps that I could take! Thanks, Ankita

Fortunately or unfortunately, learning how to get data into the correct format is one of the most common things that you will do as a bioinformatician. In your situation, you should first obtain a vector of all files (

`list.files()`

) and then read in the information via`read.table()`

,`fread()`

, or something else. You can loop through each element of the file listing via a`for`

loop, or do it better / quicker via`lapply()`

, something like:Ok, thank you! I had one last question, sorry -- since I am now using just the RMA transformed data (http://www.molmine.com/magma/loading/rma.htm) instead of z-scores, how would I interpret the HR value and generate Kaplan-Meier curves since I do not know the relative expression levels (what constitutes of high or low expression)? I was thinking that I would have to set a cut-off value or a range based on the median or the mean, but I was not sure exactly how I would go about this, or whether this would be "conventional", and would really appreciate some suggestions. Here is my new output using the RMA transformed values:

I see that my beta coefficient value is 0.393 and the HR is 1.482, but like I said earlier, since I am not using z-scores anymore and I am using quantile normalized data that has been pre-processed, I am not really sure how to interpret this.

Thank you so much!

The idea is still the same for using just the normalised expression levels, i.e., the beta coefficient represents the change between cases / controls based on 1-unit increase in your expression levels.

The interpretation of the output would indeed be easier if you converted your expression levels into, e.g.,

`high`

|`mid`

|`low`

. For example, you could convert the data to Z-scores again, and then convert the data based on:`Z < (1.96 * -1) = low`

)Z = 1.96 is a cut-off for statistical significance at 5% alpha when in a two-tailed distribution

Ok, thank you! I am working with just tumor data (so no control or normal group to compare expression levels to), so I think I will set the ranges/cut-offs based on the median and standard deviation of the whole dataset (hopefully that works). Once again, thanks for all the help, it has been very useful.