**37k**wrote:

For this example, we will load GEO breast cancer gene expression data with recurrence free survival (RFS) from Gene Expression Profiling in Breast Cancer: Understanding the Molecular Basis of Histologic Grade To Improve Prognosis. Specifically, we will encode each gene's expression into `Low`

**|** `Mid`

**|** `High`

based on Z-scores and compare these against RFS in a Cox Proportional Hazards (Cox) survival model.

# 1, read in and prepare data

```
library(Biobase)
library(GEOquery)
# load series and platform data from GEO
gset <- getGEO('GSE2990', GSEMatrix =TRUE, getGPL=FALSE)
x <- exprs(gset[[1]])
# remove Affymetrix control probes
x <- x[-grep('^AFFX', rownames(x)),]
# transform the expression data to Z scores
x <- t(scale(t(x)))
# extract information of interest from the phenotype data (pdata)
idx <- which(colnames(pData(gset[[1]])) %in%
c('age:ch1', 'distant rfs:ch1', 'er:ch1',
'ggi:ch1', 'grade:ch1', 'node:ch1',
'size:ch1', 'time rfs:ch1'))
metadata <- data.frame(pData(gset[[1]])[,idx],
row.names = rownames(pData(gset[[1]])))
# remove samples from the pdata that have any NA value
discard <- apply(metadata, 1, function(x) any( is.na(x) ))
metadata <- metadata[!discard,]
# filter the Z-scores expression data to match the samples in our pdata
x <- x[,which(colnames(x) %in% rownames(metadata))]
# check that sample names match exactly between pdata and Z-scores
all((colnames(x) == rownames(metadata)) == TRUE)
## [1] TRUE
# create a merged pdata and Z-scores object
coxdata <- data.frame(metadata, t(x))
# tidy column names
colnames(coxdata)[1:8] <- c('Age', 'Distant.RFS', 'ER',
'GGI', 'Grade', 'Node',
'Size', 'Time.RFS')
# prepare phenotypes
coxdata$Distant.RFS <- as.numeric(coxdata$Distant.RFS)
coxdata$Time.RFS <- as.numeric(gsub('^KJX|^KJ', '', coxdata$Time.RFS))
coxdata$ER <- factor(coxdata$ER, levels = c(0, 1))
coxdata$Grade <- factor(coxdata$Grade, levels = c(1, 2, 3))
```

# 2, test each gene independently via Cox regression

With the data prepared, we can now apply a Cox survival model independently for each gene (probe) in the dataset against RFS.

Here we will use RegParallel to fit the Cox model independently for each gene. *RegParallel* is accepted to Bioconductor (October 18, 2018), with official release in Bioconductor 3.8. To install dev version: `devtools::install_github('kevinblighe/RegParallel')`

```
library(survival)
library(RegParallel)
res <- RegParallel(
data = coxdata,
formula = 'Surv(Time.RFS, Distant.RFS) ~ [*]',
FUN = function(formula, data)
coxph(formula = formula,
data = data,
ties = 'breslow',
singular.ok = TRUE),
FUNtype = 'coxph',
variables = colnames(coxdata)[9:ncol(coxdata)],
blocksize = 2000,
cores = 2,
nestedParallel = FALSE,
conflevel = 95)
res <- res[!is.na(res$P),]
res
```

# 3, annotate top hits with *biomaRt*

Filter by Log Rank p<0.01

```
res <- res[order(res$LogRank, decreasing = FALSE),]
final <- subset(res, LogRank < 0.01)
probes <- gsub('^X', '', final$Variable)
library(biomaRt)
mart <- useMart('ENSEMBL_MART_ENSEMBL', host='useast.ensembl.org')
mart <- useDataset("hsapiens_gene_ensembl", mart)
annotLookup <- getBM(mart = mart,
attributes = c('affy_hg_u133a',
'ensembl_gene_id',
'gene_biotype',
'external_gene_name'),
filter = 'affy_hg_u133a',
values = probes,
uniqueRows = TRUE)
annotLookup
```

Two of the top hits include *CXCL12* and *MMP10*. High expression of *CXCL12* was associated with ** good** progression free and overall survival in breast cancer in doi: 10.1016/j.cca.2018.05.041, whilst high expression of

*MMP10*was associated with

**prognosis in colon cancer in doi: 10.1186/s12885-016-2515-7.**

*poor*# 4, encode statistically significant genes as `Low`

**|** `Mid`

**|** `High`

and plot survival curves

```
# extract RFS and probe data for downstream analysis
survplotdata <- coxdata[,c('Time.RFS', 'Distant.RFS',
'X203666_at', 'X205680_at')]
colnames(survplotdata) <- c('Time.RFS', 'Distant.RFS',
'CXCL12', 'MMP10')
# set Z-scale cut-offs for high and low expression
highExpr <- 1.0
lowExpr <- 1.0
survplotdata$CXCL12 <- ifelse(survplotdata$CXCL12 >= highExpr, 'High',
ifelse(x <= lowExpr, 'Low', 'Mid'))
survplotdata$MMP10 <- ifelse(survplotdata$MMP10 >= highExpr, 'High',
ifelse(x <= lowExpr, 'Low', 'Mid'))
# relevel the factors to have mid as the ref level
survplotdata$CXCL12 <- factor(survplotdata$CXCL12,
levels = c('Mid', 'Low', 'High'))
survplotdata$MMP10 <- factor(survplotdata$MMP10,
levels = c('Mid', 'Low', 'High'))
library(survminer)
ggsurvplot(survfit(Surv(Time.RFS, Distant.RFS) ~ CXCL12,
data = survplotdata),
data = survplotdata,
risk.table = TRUE,
pval = TRUE,
break.time.by = 500,
ggtheme = theme_minimal(),
risk.table.y.text.col = TRUE,
risk.table.y.text = FALSE)
```

```
ggsurvplot(survfit(Surv(Time.RFS, Distant.RFS) ~ MMP10,
data = survplotdata),
data = survplotdata,
risk.table = TRUE,
pval = TRUE,
break.time.by = 500,
ggtheme = theme_minimal(),
risk.table.y.text.col = TRUE,
risk.table.y.text = FALSE)
```

**10**• written 4 months ago by Kevin Blighe ♦

**37k**

One typo was found: discard <- apply(metadata, 1, function(x) anyis.na(x))) should be discard <- apply(metadata, 1, function(x) anyis.na(x)))

0Thank you for noticing that. My raw code was actually correct - the error (the lack of an extra parenthesis,

`(`

), was introduced in the visual representation of my code by the Biostars rendering system. I have added a space, and it now looks fine.37kGud one Kevin. I have a question. On what basis Z-scale cutoff 1.0 is selected?

BTW In this tutorial [http://r-addict.com/2016/11/21/Optimal-Cutpoint-maxstat.html] they have used

`maxstat`

(Maximally selected rank statistics) for the cutpoint to classify samples into high and low.130That looks like a good tutorial (through the link that you posted).

The selection of absolute Z=1 was just chosen as a very relaxed threshold for highly / lowly expressed. It is difficult to know where the exact cut-offs should be, and of course biology does not intuitively work on cut-off points. The tutorial above is for fomenting new ideas for survival analysis.

37kDear Dr. Blighe

Thank you for this tutorial. I have a question about using

`Scale()`

for transforming expression data to Z scores. basically, why do we need transforming to z scores while our original data(downloaded from GEO) is normal?I appreciate if you share your comment with me.

Mohammad

60Hello Mohammad. The conversion to Z-scores provides for an easier interpretation on the expression range for each gene. For example, on the Z-scale, we know that +3 equates to 3 standard deviations above the mean expression value in the dataset.

In a normal distribution that is not transformed to Z-scale, a value of 10, 20, 30,

et ceteramay mean nothing in the context of the expression range.The relationship between a normal distribution and the Z-scale is emphasised in this beautiful figure:

[source: https://www.mathsisfun.com/data/standard-normal-distribution.html]

37kDear Dr. Blighe

Thanks for your answer. I have another questions about your SA tutorial due to using RNA-seq expression data:

1-Generally, the measure of expression in RNA-seq is count and different from measure of expression in Microarray Technology. So, for using RNA-seq, Should I modify your survival analysis code? special in Standardization step?

2- honestly, I cant understand

`'~ [*]'`

in`formula = 'Surv(Time.RFS, Distant.RFS) ~ [*]'`

3- phenotype of my data set has fours fields: 'OS status','OS days','RFS status','RFS days'. So, based on

`RegParallel()`

, can I compute 'res' using my phenotype fields? if yes, how can I use these fields in`RegParallel()`

?I deeply appreciate if you share your comment with me.

Mohammad

60You should aim to transform your normalised RNA-seq counts via the variance-stabilised or regularised log transformation (if using DESeq2), or produce log CPM counts (if using EdgeR). Then, you can generally use

`glm()`

, as I use above. If you are aiming to use the normalised, un-transformed counts, then you could use the negative binomial regression via`glm.nb()`

- this may be too advanced, though. Check the manual (via`?RegParallel`

) and vignette forRegParallel.Remember that, in RNA-seq, the general process goes:

This is annotation specific to my package,

RegParallel. Each gene will replace the`[*]`

symbol as the package tests each gene in an independent model. Again, please read the manual and vignette.Then you are likely aiming to do a survival analysis. In that case, you can use

`coxph()`

. Your commands would be:...or:

Note, you will likely have to change the value to

`variables`

. Variables is a vector of gene names that you want to test. It can be any number. If usingRegParallel, the idea is that you have hundreds or thousands or millions of genes to test.37kDear Dr. Blighe

Really Thanks for your answer. But about my first question, I would like to explain more about my data set. It belongs to TCGA and I downloaded as UQ-FPKM. In RNA-seq analysis, this type of data set is normal. So, for using that I transformed it to Log2 space.

1- now, for using this data should I

`scale()`

for transformation to z-score?2- based on my explanationabout TCGA data, which functions are better:

`glm()`

or`glm.nb()`

?3- why you didn't use

`coxph()`

for RNA-seq expression data set in RegParallel vignett? can I use this function for my data set?I deeply appreciate if you share your comment with me.

Mohammad

60Sorry, this is not how Biostars functions. I expect you to read my comments and to then spend some time researching the answers to any further questions that you have. That is the best form of learning.

37kDear Dr.Blighe

Here you design Survival plot for 2 genes: 'MMP10' and 'CXCL12'. Suppose that we have a bunch of gene and after clustering we have n cluster. how can we design Surv plot for each cluster separately?

I appreciate if you share your solution with me

60What do you mean by '

n cluster'?37kn is number of cluster. For example 3 cluster(n=3).

60If you encode the gene's expression as a factor / categorical variable, then the survival function will plot a curve for each level

37kHi Kevin, I am curious to ask can we use Beta values for methylation from each probe instead of the read-count from gene expression.

If yes, these values are continuous and range from 0 to 1, would it be recommended to convert these also to Z score.

10Hey, yes, you could use the Beta values from methylation for the purposes of survival analysis. I think that it is okay to leave the values as 0 to 1.

37k