**52k**wrote:

For this example, we will load GEO breast cancer gene expression data with recurrence free survival (RFS) from Gene Expression Profiling in Breast Cancer: Understanding the Molecular Basis of Histologic Grade To Improve Prognosis. Specifically, we will encode each gene's expression into `Low`

**|** `Mid`

**|** `High`

based on Z-scores and compare these against RFS in a Cox Proportional Hazards (Cox) survival model.

# 1, read in and prepare data

```
library(Biobase)
library(GEOquery)
# load series and platform data from GEO
gset <- getGEO('GSE2990', GSEMatrix =TRUE, getGPL=FALSE)
x <- exprs(gset[[1]])
# remove Affymetrix control probes
x <- x[-grep('^AFFX', rownames(x)),]
# transform the expression data to Z scores
x <- t(scale(t(x)))
# extract information of interest from the phenotype data (pdata)
idx <- which(colnames(pData(gset[[1]])) %in%
c('age:ch1', 'distant rfs:ch1', 'er:ch1',
'ggi:ch1', 'grade:ch1', 'node:ch1',
'size:ch1', 'time rfs:ch1'))
metadata <- data.frame(pData(gset[[1]])[,idx],
row.names = rownames(pData(gset[[1]])))
# remove samples from the pdata that have any NA value
discard <- apply(metadata, 1, function(x) any( is.na(x) ))
metadata <- metadata[!discard,]
# filter the Z-scores expression data to match the samples in our pdata
x <- x[,which(colnames(x) %in% rownames(metadata))]
# check that sample names match exactly between pdata and Z-scores
all((colnames(x) == rownames(metadata)) == TRUE)
## [1] TRUE
# create a merged pdata and Z-scores object
coxdata <- data.frame(metadata, t(x))
# tidy column names
colnames(coxdata)[1:8] <- c('Age', 'Distant.RFS', 'ER',
'GGI', 'Grade', 'Node',
'Size', 'Time.RFS')
# prepare phenotypes
coxdata$Distant.RFS <- as.numeric(coxdata$Distant.RFS)
coxdata$Time.RFS <- as.numeric(gsub('^KJX|^KJ', '', coxdata$Time.RFS))
coxdata$ER <- factor(coxdata$ER, levels = c(0, 1))
coxdata$Grade <- factor(coxdata$Grade, levels = c(1, 2, 3))
```

# 2, test each gene independently via Cox regression

With the data prepared, we can now apply a Cox survival model independently for each gene (probe) in the dataset against RFS.

Here we will use RegParallel to fit the Cox model independently for each gene.

```
library(survival)
library(RegParallel)
res <- RegParallel(
data = coxdata,
formula = 'Surv(Time.RFS, Distant.RFS) ~ [*]',
FUN = function(formula, data)
coxph(formula = formula,
data = data,
ties = 'breslow',
singular.ok = TRUE),
FUNtype = 'coxph',
variables = colnames(coxdata)[9:ncol(coxdata)],
blocksize = 2000,
cores = 2,
nestedParallel = FALSE,
conflevel = 95)
res <- res[!is.na(res$P),]
res
```

# 3, annotate top hits with *biomaRt*

Filter by Log Rank p<0.01

```
res <- res[order(res$LogRank, decreasing = FALSE),]
final <- subset(res, LogRank < 0.01)
probes <- gsub('^X', '', final$Variable)
library(biomaRt)
mart <- useMart('ENSEMBL_MART_ENSEMBL', host='useast.ensembl.org')
mart <- useDataset("hsapiens_gene_ensembl", mart)
annotLookup <- getBM(mart = mart,
attributes = c('affy_hg_u133a',
'ensembl_gene_id',
'gene_biotype',
'external_gene_name'),
filter = 'affy_hg_u133a',
values = probes,
uniqueRows = TRUE)
annotLookup
```

Two of the top hits include *CXCL12* and *MMP10*. High expression of *CXCL12* was associated with ** good** progression free and overall survival in breast cancer in doi: 10.1016/j.cca.2018.05.041, whilst high expression of

*MMP10*was associated with

**prognosis in colon cancer in doi: 10.1186/s12885-016-2515-7.**

*poor*# 4, encode statistically significant genes as `Low`

**|** `Mid`

**|** `High`

and plot survival curves

```
# extract RFS and probe data for downstream analysis
survplotdata <- coxdata[,c('Time.RFS', 'Distant.RFS',
'X203666_at', 'X205680_at')]
colnames(survplotdata) <- c('Time.RFS', 'Distant.RFS',
'CXCL12', 'MMP10')
# set Z-scale cut-offs for high and low expression
highExpr <- 1.0
lowExpr <- -1.0
survplotdata$CXCL12 <- ifelse(survplotdata$CXCL12 >= highExpr, 'High',
ifelse(survplotdata$CXCL12 <= lowExpr, 'Low', 'Mid'))
survplotdata$MMP10 <- ifelse(survplotdata$MMP10 >= highExpr, 'High',
ifelse(survplotdata$MMP10 <= lowExpr, 'Low', 'Mid'))
# relevel the factors to have mid as the ref level
survplotdata$CXCL12 <- factor(survplotdata$CXCL12,
levels = c('Mid', 'Low', 'High'))
survplotdata$MMP10 <- factor(survplotdata$MMP10,
levels = c('Mid', 'Low', 'High'))
library(survminer)
ggsurvplot(survfit(Surv(Time.RFS, Distant.RFS) ~ CXCL12,
data = survplotdata),
data = survplotdata,
risk.table = TRUE,
pval = TRUE,
break.time.by = 500,
ggtheme = theme_minimal(),
risk.table.y.text.col = TRUE,
risk.table.y.text = FALSE)
```

```
ggsurvplot(survfit(Surv(Time.RFS, Distant.RFS) ~ MMP10,
data = survplotdata),
data = survplotdata,
risk.table = TRUE,
pval = TRUE,
break.time.by = 500,
ggtheme = theme_minimal(),
risk.table.y.text.col = TRUE,
risk.table.y.text = FALSE)
```

Kevin,

Unless there is a problem on my end, I think something may have gotten deprecated here. Running code as is only gives me mid and high curves for both genes.

Inspection of

shows that no samples meet the -1 zscore low expression cutoff (as far as I can see). I don't really have any questions about this. I just thought I would point it out just in case it is a repeatable error.

140Hey, that is strange - thanks for the alert. I also just re-ran my own code and observe the same 'phenomenon'. Nothing surprises me anymore in bioinformatics, though. Bioinformatics is like the Wild Wild West.

Here are the new survival curves for this tutorial:

CXCL12## -------------------------

MMP1052kKevin,

I actually do have a quick question related to this now that I think about it (if you have time). I see you have your expression factor with three levels:

In theory this was supposed to produce three curves. Lets say I have a similar multi leveled expression factor that produces multiple curves and I want to do a test that makes a pairwise comparison of every single curve. ie low vs mid, mid vs high etc.

I should just be able to run this command at endpoint which as I understand gives a benjamini hochberg adjusted log-rank test p value for every possible comparison of the multiple curves.

Is this reasonable?

Thank you

140Yes, well, in the example above (my example), we could have done it better by dividing the expression range into tertiles to ensure that there would be at least 1 sample per group. I just chose a hard cut-off of Z=1, though. The tutorial is just to foment ideas, though.

I am not familiar with

`pairwise_survdiff()`

but it looks like a useful function. Thanks for mentioning it here.52k@Kevin Blighe:

Many thanks for your community contribution in Biostars, this thread is very informative and helpful to learn RNA-Seq analysis.

80One typo was found: discard <- apply(metadata, 1, function(x) anyis.na(x))) should be discard <- apply(metadata, 1, function(x) anyis.na(x)))

0Thank you for noticing that. My raw code was actually correct - the error (the lack of an extra parenthesis,

`(`

), was introduced in the visual representation of my code by the Biostars rendering system. I have added a space, and it now looks fine.52kGud one Kevin. I have a question. On what basis Z-scale cutoff 1.0 is selected?

BTW In this tutorial [http://r-addict.com/2016/11/21/Optimal-Cutpoint-maxstat.html] they have used

`maxstat`

(Maximally selected rank statistics) for the cutpoint to classify samples into high and low.190That looks like a good tutorial (through the link that you posted).

The selection of absolute Z=1 was just chosen as a very relaxed threshold for highly / lowly expressed. It is difficult to know where the exact cut-offs should be, and of course biology does not intuitively work on cut-off points. The tutorial above is for fomenting new ideas for survival analysis.

52kDear Dr. Blighe

Thank you for this tutorial. I have a question about using

`Scale()`

for transforming expression data to Z scores. basically, why do we need transforming to z scores while our original data(downloaded from GEO) is normal?I appreciate if you share your comment with me.

Mohammad

90Hello Mohammad. The conversion to Z-scores provides for an easier interpretation on the expression range for each gene. For example, on the Z-scale, we know that +3 equates to 3 standard deviations above the mean expression value in the dataset.

In a normal distribution that is not transformed to Z-scale, a value of 10, 20, 30,

et ceteramay mean nothing in the context of the expression range.The relationship between a normal distribution and the Z-scale is emphasised in this beautiful figure:

[source: https://www.mathsisfun.com/data/standard-normal-distribution.html]

52kDear Dr. Blighe

Thanks for your answer. I have another questions about your SA tutorial due to using RNA-seq expression data:

1-Generally, the measure of expression in RNA-seq is count and different from measure of expression in Microarray Technology. So, for using RNA-seq, Should I modify your survival analysis code? special in Standardization step?

2- honestly, I cant understand

`'~ [*]'`

in`formula = 'Surv(Time.RFS, Distant.RFS) ~ [*]'`

3- phenotype of my data set has fours fields: 'OS status','OS days','RFS status','RFS days'. So, based on

`RegParallel()`

, can I compute 'res' using my phenotype fields? if yes, how can I use these fields in`RegParallel()`

?I deeply appreciate if you share your comment with me.

Mohammad

90You should aim to transform your normalised RNA-seq counts via the variance-stabilised or regularised log transformation (if using DESeq2), or produce log CPM counts (if using EdgeR). Then, you can generally use

`glm()`

, as I use above. If you are aiming to use the normalised, un-transformed counts, then you could use the negative binomial regression via`glm.nb()`

- this may be too advanced, though. Check the manual (via`?RegParallel`

) and vignette forRegParallel.Remember that, in RNA-seq, the general process goes:

This is annotation specific to my package,

RegParallel. Each gene will replace the`[*]`

symbol as the package tests each gene in an independent model. Again, please read the manual and vignette.Then you are likely aiming to do a survival analysis. In that case, you can use

`coxph()`

. Your commands would be:...or:

Note, you will likely have to change the value to

`variables`

. Variables is a vector of gene names that you want to test. It can be any number. If usingRegParallel, the idea is that you have hundreds or thousands or millions of genes to test.52k"normalised counts (statistical analyses performed on these) -->" i have this doubt found that you mentioned about it , are you saying about this function "counts(dds, normalized=TRUE)" whose value can be used for any non parametric statistical lest?

As of now i used mostly rlog and vst value for clustering and pca etc . Not sure if i can use any test on the rlog or vst value , as im not sure if its mathematically correct thing to do with the rlog or vst value or not...

If you can clarify it would be really helpful

630No, it is just in the DESeq2 protocol (and EdgeR). The statistical comparisons are conducted on the normalised, un-transformed counts, which follow a negative binomial distribution. DESeq2 derives p-values, generally, as follows:

*there are variations to thisOne can, of course, produce normalised, transformed counts, and perform their own analyses on these.

52k"No, it is just in the DESeq2 protocol (and EdgeR). The statistical comparisons are conducted on the normalised, un-transformed counts, which follow a negative binomial distribution. DESeq2 derives p-values, generally, as follows:

fit negative binomial regression model independently for each gene's normalised counts extract p-value from the model coefficient via the Wald test applied to the model" yes this part im clear as i read the same in the paper

"of course, produce normalised, transformed counts, and perform their own analyses on these." yes for this one as i get certain genes and i want to make comparison between biological sample .So if i do that comparison running some non parametric test then its not a problem , I guess

630I see. Which test are you doing?

52kI tried wilcoxon rank test as of now...

630Seems okay to me. Ask 10 people and you'll get 10 different answers, though. Each answer is based on the respective experience of the individual.

52kone doubt is clear now

630Dear Dr. Blighe

I use TPM(Transaction per million) method for normalizing my RNA-Seq data set. base on your perfect tutorial I ran

RegParallel()for getting survival analysis. I need your comment for 2 below questions:1- I use 'coxph' as FUNtype for the regression model. is it a suitable function for my problem. if no, which function is your suggestion?

2- As you know in literature, we have multivariate Cox regression and univariate Cox regression. based on

RegParallel(), our Survival Analysis is multivariate or univariate?It would be really helpful If you can clarify me.

90Yes,

coxphis the correct function. TPM is not too bad if you are testing each gene independently, i.e., univariate (in my tutorial, above, each gene is tested independently as part of a univariate Cox model);52kOk, Thanks for your comment. logically, doing multivariate Cox Regression for lots of genes(more than 150 genes) is true? if you agree, how can I run it? can you guide me by tutorial such as the above tutorial?

90If you want to run a multivariate Cox model with that many variables, then consider using the Lasso Cox model: https://cran.r-project.org/web/packages/glmnet/vignettes/Coxnet.pdf

52kDear Dr. Blighe

Really Thanks for your answer. But about my first question, I would like to explain more about my data set. It belongs to TCGA and I downloaded as UQ-FPKM. In RNA-seq analysis, this type of data set is normal. So, for using that I transformed it to Log2 space.

1- now, for using this data should I

`scale()`

for transformation to z-score?2- based on my explanationabout TCGA data, which functions are better:

`glm()`

or`glm.nb()`

?3- why you didn't use

`coxph()`

for RNA-seq expression data set in RegParallel vignett? can I use this function for my data set?I deeply appreciate if you share your comment with me.

Mohammad

90Sorry, this is not how Biostars functions. I expect you to read my comments and to then spend some time researching the answers to any further questions that you have. That is the best form of learning.

52kDear Dr.Blighe

Here you design Survival plot for 2 genes: 'MMP10' and 'CXCL12'. Suppose that we have a bunch of gene and after clustering we have n cluster. how can we design Surv plot for each cluster separately?

I appreciate if you share your solution with me

90What do you mean by '

n cluster'?52kn is number of cluster. For example 3 cluster(n=3).

90If you encode the gene's expression as a factor / categorical variable, then the survival function will plot a curve for each level

52kHi Kevin, I am curious to ask can we use Beta values for methylation from each probe instead of the read-count from gene expression.

If yes, these values are continuous and range from 0 to 1, would it be recommended to convert these also to Z score.

50Hey, yes, you could use the Beta values from methylation for the purposes of survival analysis. I think that it is okay to leave the values as 0 to 1.

52kThanks Kevin, I tried your suggestion and was able to identify prognostic CpG sites. I did the same using gene expression data and interestingly found some overlapping genes. But I am not very sure how to integrate these two results as methylation can regulate the expression of genes that are in trans. In order to address that, checking just the overlap would not work. I do not know how should I proceed. I will really appreciate if u can share your thoughts about it.

Thanks B

50Hi Dr. Blighe, My survplotdata is as below:

I used 0 as cut-offs for high and low expression

When I run this code:

I get this Error:

I appreciate if you guide me and share your comment for solving that Error with me.

90Take a look at

`?ifelse`

52kThanks, Dr. Blighe. I solved my problem but in the below code:

I get below Error:

I appreciate if you guide me and share your comment for solving that Error with me.

90Okay, please spend some more time to debug the error on your own. Check the encoding of your variables, and check what

`survfit()`

and`ggsurvplot()`

expect.52kHi Kevin, thanks for creating this package. I'm learning survival analysis, and am finding your tutorial is very helpful.

I was wondering regarding your suggestion to arrange the tests by log rank p value. From my understanding, the log rank test is computed comparing survival time between groups. So in the RegParallel function, is gene expression being dichotomized? If so, how exactly---is it using Z-score +/- 1? Moreover, because gene expression is continuous, would it not make sense to select 'statistically significant' genes based on p value (and adjust those instead of the log rank p value)?

Thanks, Victor

0No, the package just accepts whatever data that you use. It can be continuous or categorical. It is just in this tutorial that I dichotomise the gene expression values before using the RegfParallel package.

Moreover, because gene expression is continuous, would it not make sense to select 'statistically significant' genes based on p value (and adjust those instead of the log rank p value)?

You can do whatever approach seems valid to you. There is no correct or incorrect approach.

52kwithout clinical information this is not possible to do so isn;t it? If i look at the microarray data of liquid tumor they dont give information as such as you have used here. Is there still a way to run survival analysis ?

630Hey, what information do you have, exactly?

52kso far the microarray data for AML have checked are mostly array expression, they dont give the clinical information of the patients which in this case you have for the breast cancer data set.

630