**54k**wrote:

For this example, we will load GEO breast cancer gene expression data with recurrence free survival (RFS) from Gene Expression Profiling in Breast Cancer: Understanding the Molecular Basis of Histologic Grade To Improve Prognosis. Specifically, we will encode each gene's expression into `Low`

**|** `Mid`

**|** `High`

based on Z-scores and compare these against RFS in a Cox Proportional Hazards (Cox) survival model.

# 1, read in and prepare data

```
library(Biobase)
library(GEOquery)
# load series and platform data from GEO
gset <- getGEO('GSE2990', GSEMatrix =TRUE, getGPL=FALSE)
x <- exprs(gset[[1]])
# remove Affymetrix control probes
x <- x[-grep('^AFFX', rownames(x)),]
# transform the expression data to Z scores
x <- t(scale(t(x)))
# extract information of interest from the phenotype data (pdata)
idx <- which(colnames(pData(gset[[1]])) %in%
c('age:ch1', 'distant rfs:ch1', 'er:ch1',
'ggi:ch1', 'grade:ch1', 'node:ch1',
'size:ch1', 'time rfs:ch1'))
metadata <- data.frame(pData(gset[[1]])[,idx],
row.names = rownames(pData(gset[[1]])))
# remove samples from the pdata that have any NA value
discard <- apply(metadata, 1, function(x) any( is.na(x) ))
metadata <- metadata[!discard,]
# filter the Z-scores expression data to match the samples in our pdata
x <- x[,which(colnames(x) %in% rownames(metadata))]
# check that sample names match exactly between pdata and Z-scores
all((colnames(x) == rownames(metadata)) == TRUE)
## [1] TRUE
# create a merged pdata and Z-scores object
coxdata <- data.frame(metadata, t(x))
# tidy column names
colnames(coxdata)[1:8] <- c('Age', 'Distant.RFS', 'ER',
'GGI', 'Grade', 'Node',
'Size', 'Time.RFS')
# prepare phenotypes
coxdata$Distant.RFS <- as.numeric(coxdata$Distant.RFS)
coxdata$Time.RFS <- as.numeric(gsub('^KJX|^KJ', '', coxdata$Time.RFS))
coxdata$ER <- factor(coxdata$ER, levels = c(0, 1))
coxdata$Grade <- factor(coxdata$Grade, levels = c(1, 2, 3))
```

# 2, test each gene independently via Cox regression

With the data prepared, we can now apply a Cox survival model independently for each gene (probe) in the dataset against RFS.

Here we will use RegParallel to fit the Cox model independently for each gene.

```
library(survival)
library(RegParallel)
res <- RegParallel(
data = coxdata,
formula = 'Surv(Time.RFS, Distant.RFS) ~ [*]',
FUN = function(formula, data)
coxph(formula = formula,
data = data,
ties = 'breslow',
singular.ok = TRUE),
FUNtype = 'coxph',
variables = colnames(coxdata)[9:ncol(coxdata)],
blocksize = 2000,
cores = 2,
nestedParallel = FALSE,
conflevel = 95)
res <- res[!is.na(res$P),]
res
```

# 3, annotate top hits with *biomaRt*

Filter by Log Rank p<0.01

```
res <- res[order(res$LogRank, decreasing = FALSE),]
final <- subset(res, LogRank < 0.01)
probes <- gsub('^X', '', final$Variable)
library(biomaRt)
mart <- useMart('ENSEMBL_MART_ENSEMBL', host='useast.ensembl.org')
mart <- useDataset("hsapiens_gene_ensembl", mart)
annotLookup <- getBM(mart = mart,
attributes = c('affy_hg_u133a',
'ensembl_gene_id',
'gene_biotype',
'external_gene_name'),
filter = 'affy_hg_u133a',
values = probes,
uniqueRows = TRUE)
annotLookup
```

Two of the top hits include *CXCL12* and *MMP10*. High expression of *CXCL12* was associated with ** good** progression free and overall survival in breast cancer in doi: 10.1016/j.cca.2018.05.041, whilst high expression of

*MMP10*was associated with

**prognosis in colon cancer in doi: 10.1186/s12885-016-2515-7.**

*poor*# 4, encode statistically significant genes as `Low`

**|** `Mid`

**|** `High`

and plot survival curves

```
# extract RFS and probe data for downstream analysis
survplotdata <- coxdata[,c('Time.RFS', 'Distant.RFS',
'X203666_at', 'X205680_at')]
colnames(survplotdata) <- c('Time.RFS', 'Distant.RFS',
'CXCL12', 'MMP10')
# set Z-scale cut-offs for high and low expression
highExpr <- 1.0
lowExpr <- -1.0
survplotdata$CXCL12 <- ifelse(survplotdata$CXCL12 >= highExpr, 'High',
ifelse(survplotdata$CXCL12 <= lowExpr, 'Low', 'Mid'))
survplotdata$MMP10 <- ifelse(survplotdata$MMP10 >= highExpr, 'High',
ifelse(survplotdata$MMP10 <= lowExpr, 'Low', 'Mid'))
# relevel the factors to have mid as the ref level
survplotdata$CXCL12 <- factor(survplotdata$CXCL12,
levels = c('Mid', 'Low', 'High'))
survplotdata$MMP10 <- factor(survplotdata$MMP10,
levels = c('Mid', 'Low', 'High'))
library(survminer)
ggsurvplot(survfit(Surv(Time.RFS, Distant.RFS) ~ CXCL12,
data = survplotdata),
data = survplotdata,
risk.table = TRUE,
pval = TRUE,
break.time.by = 500,
ggtheme = theme_minimal(),
risk.table.y.text.col = TRUE,
risk.table.y.text = FALSE)
```

```
ggsurvplot(survfit(Surv(Time.RFS, Distant.RFS) ~ MMP10,
data = survplotdata),
data = survplotdata,
risk.table = TRUE,
pval = TRUE,
break.time.by = 500,
ggtheme = theme_minimal(),
risk.table.y.text.col = TRUE,
risk.table.y.text = FALSE)
```

**10**• written 16 months ago by Kevin Blighe ♦

**54k**

Kevin,

Unless there is a problem on my end, I think something may have gotten deprecated here. Running code as is only gives me mid and high curves for both genes.

Inspection of

shows that no samples meet the -1 zscore low expression cutoff (as far as I can see). I don't really have any questions about this. I just thought I would point it out just in case it is a repeatable error.

210Hey, that is strange - thanks for the alert. I also just re-ran my own code and observe the same 'phenomenon'. Nothing surprises me anymore in bioinformatics, though. Bioinformatics is like the Wild Wild West.

Here are the new survival curves for this tutorial:

CXCL12## -------------------------

MMP1054kKevin,

I actually do have a quick question related to this now that I think about it (if you have time). I see you have your expression factor with three levels:

In theory this was supposed to produce three curves. Lets say I have a similar multi leveled expression factor that produces multiple curves and I want to do a test that makes a pairwise comparison of every single curve. ie low vs mid, mid vs high etc.

I should just be able to run this command at endpoint which as I understand gives a benjamini hochberg adjusted log-rank test p value for every possible comparison of the multiple curves.

Is this reasonable?

Thank you

210Yes, well, in the example above (my example), we could have done it better by dividing the expression range into tertiles to ensure that there would be at least 1 sample per group. I just chose a hard cut-off of Z=1, though. The tutorial is just to foment ideas, though.

I am not familiar with

`pairwise_survdiff()`

but it looks like a useful function. Thanks for mentioning it here.54kDear Dr. Kevin,

As in the K-M plot clear, after running

`ggsurvplot`

we plot Kaplan Meyer which we can see a p-value on it. Here for "MMP10", the p-value equals 0.00047 in your example. I ran the same as your code for my target gene and also ran the Cox Proportional-Hazards Model for that.In my case, the p-value resulted from the Cox regression is 0.04 but the p-value resulted

`ggsurvplot`

for the K-M plot is about 0.1. based on Cox's p-value my study is significant but based on the K-M plot p-value isn't(greater than 0.05). is that results logically acceptable? If yes which p-value should be ignored and which one accepted?I appreciate it if you share your comment with me.

110Please show the exact code that you have used in order to clearly show from where you are deriving your p-values.

54kFor cox regression I use below code:

And by runnig that code I got below result:

As you see the P-Value(Pr(>|z|)) equal 0.0393. now in the following I performed K-M plot generating code:

So, in the following link the result of K-M plot is accecible. and you can see P-value in the plot equals 0.25:

https://www.dropbox.com/s/8rn89ithvqfyfqk/Rplot_K-M_MEturquoise_OS_981018.bmp?dl=0

I appreciate it if you share your comment with me

110These are different functions, so, you should not expect that they return the same p-values. Take a look here:

54kDear Dr. Blighe Thanks for your comment. but as I wrote in the last line of

`summary(fit_SARC_turquoise)`

result you can find Score (log rank) test in which the p-value equals 0.04 by 1 df. but this log rank p-value is different from p-value in K-M plot in this link: https://www.dropbox.com/s/8rn89ithvqfyfqk/Rplot_K-M_MEturquoise_OS_981018.bmp?dl=0I appreciate it if you share your comment with me

110Is

`survplotSARCturquoisedata`

thesame asexact`coxSARCdata`

?54kNo, because

`coxSARCdata`

has a few columns and`survplotSARCturquoisedata`

is a subset of`coxSARCdata`

. you mean for that reason they don't have similar P-value. Ok, Dear Dr. Blighe, how can I interpret this unsimilarity of 2 log-rank P-value resulted from the Cox regression and K-M plot? Can I insert P-value resulted from Cox regression in the K-M plot picture instead K-M plot P-value?110Yes, you can add any p-value to the K-M plot - all that you need to do is:

However, you need to be sure that this is the correct thing to do.

54kDear Dr. Blighe, I have 2 more questions:

1- I need to show K-M plots for 7 genes in one picture. How can I do it?

2- I need to resize of Font of labels(Survival probability, time,..) in the K-M plot.

I appreciate it if you guide me that how can I do them via my code.

11054kDoes that help?

54k@Kevin Blighe:

Many thanks for your community contribution in Biostars, this thread is very informative and helpful to learn RNA-Seq analysis.

80One typo was found: discard <- apply(metadata, 1, function(x) anyis.na(x))) should be discard <- apply(metadata, 1, function(x) anyis.na(x)))

0Thank you for noticing that. My raw code was actually correct - the error (the lack of an extra parenthesis,

`(`

), was introduced in the visual representation of my code by the Biostars rendering system. I have added a space, and it now looks fine.54kGud one Kevin. I have a question. On what basis Z-scale cutoff 1.0 is selected?

BTW In this tutorial [http://r-addict.com/2016/11/21/Optimal-Cutpoint-maxstat.html] they have used

`maxstat`

(Maximally selected rank statistics) for the cutpoint to classify samples into high and low.190That looks like a good tutorial (through the link that you posted).

The selection of absolute Z=1 was just chosen as a very relaxed threshold for highly / lowly expressed. It is difficult to know where the exact cut-offs should be, and of course biology does not intuitively work on cut-off points. The tutorial above is for fomenting new ideas for survival analysis.

54kDear Dr. Blighe

Thank you for this tutorial. I have a question about using

`Scale()`

for transforming expression data to Z scores. basically, why do we need transforming to z scores while our original data(downloaded from GEO) is normal?I appreciate if you share your comment with me.

Mohammad

110Hello Mohammad. The conversion to Z-scores provides for an easier interpretation on the expression range for each gene. For example, on the Z-scale, we know that +3 equates to 3 standard deviations above the mean expression value in the dataset.

In a normal distribution that is not transformed to Z-scale, a value of 10, 20, 30,

et ceteramay mean nothing in the context of the expression range.The relationship between a normal distribution and the Z-scale is emphasised in this beautiful figure:

[source: https://www.mathsisfun.com/data/standard-normal-distribution.html]

54kDear Dr. Blighe

Thanks for your answer. I have another questions about your SA tutorial due to using RNA-seq expression data:

1-Generally, the measure of expression in RNA-seq is count and different from measure of expression in Microarray Technology. So, for using RNA-seq, Should I modify your survival analysis code? special in Standardization step?

2- honestly, I cant understand

`'~ [*]'`

in`formula = 'Surv(Time.RFS, Distant.RFS) ~ [*]'`

3- phenotype of my data set has fours fields: 'OS status','OS days','RFS status','RFS days'. So, based on

`RegParallel()`

, can I compute 'res' using my phenotype fields? if yes, how can I use these fields in`RegParallel()`

?I deeply appreciate if you share your comment with me.

Mohammad

110You should aim to transform your normalised RNA-seq counts via the variance-stabilised or regularised log transformation (if using DESeq2), or produce log CPM counts (if using EdgeR). Then, you can generally use

`glm()`

, as I use above. If you are aiming to use the normalised, un-transformed counts, then you could use the negative binomial regression via`glm.nb()`

- this may be too advanced, though. Check the manual (via`?RegParallel`

) and vignette forRegParallel.Remember that, in RNA-seq, the general process goes:

This is annotation specific to my package,

RegParallel. Each gene will replace the`[*]`

symbol as the package tests each gene in an independent model. Again, please read the manual and vignette.Then you are likely aiming to do a survival analysis. In that case, you can use

`coxph()`

. Your commands would be:...or:

Note, you will likely have to change the value to

`variables`

. Variables is a vector of gene names that you want to test. It can be any number. If usingRegParallel, the idea is that you have hundreds or thousands or millions of genes to test.54k"normalised counts (statistical analyses performed on these) -->" i have this doubt found that you mentioned about it , are you saying about this function "counts(dds, normalized=TRUE)" whose value can be used for any non parametric statistical lest?

As of now i used mostly rlog and vst value for clustering and pca etc . Not sure if i can use any test on the rlog or vst value , as im not sure if its mathematically correct thing to do with the rlog or vst value or not...

If you can clarify it would be really helpful

640No, it is just in the DESeq2 protocol (and EdgeR). The statistical comparisons are conducted on the normalised, un-transformed counts, which follow a negative binomial distribution. DESeq2 derives p-values, generally, as follows:

*there are variations to thisOne can, of course, produce normalised, transformed counts, and perform their own analyses on these.

54k"No, it is just in the DESeq2 protocol (and EdgeR). The statistical comparisons are conducted on the normalised, un-transformed counts, which follow a negative binomial distribution. DESeq2 derives p-values, generally, as follows:

fit negative binomial regression model independently for each gene's normalised counts extract p-value from the model coefficient via the Wald test applied to the model" yes this part im clear as i read the same in the paper

"of course, produce normalised, transformed counts, and perform their own analyses on these." yes for this one as i get certain genes and i want to make comparison between biological sample .So if i do that comparison running some non parametric test then its not a problem , I guess

640I see. Which test are you doing?

54kI tried wilcoxon rank test as of now...

640Seems okay to me. Ask 10 people and you'll get 10 different answers, though. Each answer is based on the respective experience of the individual.

54kone doubt is clear now

640Dear Dr. Blighe

I use TPM(Transaction per million) method for normalizing my RNA-Seq data set. base on your perfect tutorial I ran

RegParallel()for getting survival analysis. I need your comment for 2 below questions:1- I use 'coxph' as FUNtype for the regression model. is it a suitable function for my problem. if no, which function is your suggestion?

2- As you know in literature, we have multivariate Cox regression and univariate Cox regression. based on

RegParallel(), our Survival Analysis is multivariate or univariate?It would be really helpful If you can clarify me.

110Yes,

coxphis the correct function. TPM is not too bad if you are testing each gene independently, i.e., univariate (in my tutorial, above, each gene is tested independently as part of a univariate Cox model);54kOk, Thanks for your comment. logically, doing multivariate Cox Regression for lots of genes(more than 150 genes) is true? if you agree, how can I run it? can you guide me by tutorial such as the above tutorial?

110If you want to run a multivariate Cox model with that many variables, then consider using the Lasso Cox model: https://cran.r-project.org/web/packages/glmnet/vignettes/Coxnet.pdf

54kDear Dr. Blighe

Really Thanks for your answer. But about my first question, I would like to explain more about my data set. It belongs to TCGA and I downloaded as UQ-FPKM. In RNA-seq analysis, this type of data set is normal. So, for using that I transformed it to Log2 space.

1- now, for using this data should I

`scale()`

for transformation to z-score?2- based on my explanationabout TCGA data, which functions are better:

`glm()`

or`glm.nb()`

?3- why you didn't use

`coxph()`

for RNA-seq expression data set in RegParallel vignett? can I use this function for my data set?I deeply appreciate if you share your comment with me.

Mohammad

110Sorry, this is not how Biostars functions. I expect you to read my comments and to then spend some time researching the answers to any further questions that you have. That is the best form of learning.

54kDear Dr.Blighe

Here you design Survival plot for 2 genes: 'MMP10' and 'CXCL12'. Suppose that we have a bunch of gene and after clustering we have n cluster. how can we design Surv plot for each cluster separately?

I appreciate if you share your solution with me

110What do you mean by '

n cluster'?54kn is number of cluster. For example 3 cluster(n=3).

110If you encode the gene's expression as a factor / categorical variable, then the survival function will plot a curve for each level

54kHi Kevin, I am curious to ask can we use Beta values for methylation from each probe instead of the read-count from gene expression.

If yes, these values are continuous and range from 0 to 1, would it be recommended to convert these also to Z score.

50Hey, yes, you could use the Beta values from methylation for the purposes of survival analysis. I think that it is okay to leave the values as 0 to 1.

54kThanks Kevin, I tried your suggestion and was able to identify prognostic CpG sites. I did the same using gene expression data and interestingly found some overlapping genes. But I am not very sure how to integrate these two results as methylation can regulate the expression of genes that are in trans. In order to address that, checking just the overlap would not work. I do not know how should I proceed. I will really appreciate if u can share your thoughts about it.

Thanks B

50Hi Dr. Blighe, My survplotdata is as below:

I used 0 as cut-offs for high and low expression

When I run this code:

I get this Error:

I appreciate if you guide me and share your comment for solving that Error with me.

110Take a look at

`?ifelse`

54kThanks, Dr. Blighe. I solved my problem but in the below code:

I get below Error:

I appreciate if you guide me and share your comment for solving that Error with me.

110Okay, please spend some more time to debug the error on your own. Check the encoding of your variables, and check what

`survfit()`

and`ggsurvplot()`

expect.54kHi Kevin, thanks for creating this package. I'm learning survival analysis, and am finding your tutorial is very helpful.

I was wondering regarding your suggestion to arrange the tests by log rank p value. From my understanding, the log rank test is computed comparing survival time between groups. So in the RegParallel function, is gene expression being dichotomized? If so, how exactly---is it using Z-score +/- 1? Moreover, because gene expression is continuous, would it not make sense to select 'statistically significant' genes based on p value (and adjust those instead of the log rank p value)?

Thanks, Victor

0No, the package just accepts whatever data that you use. It can be continuous or categorical. It is just in this tutorial that I dichotomise the gene expression values before using the RegfParallel package.

Moreover, because gene expression is continuous, would it not make sense to select 'statistically significant' genes based on p value (and adjust those instead of the log rank p value)?

You can do whatever approach seems valid to you. There is no correct or incorrect approach.

54kwithout clinical information this is not possible to do so isn;t it? If i look at the microarray data of liquid tumor they dont give information as such as you have used here. Is there still a way to run survival analysis ?

640Hey, what information do you have, exactly?

54kso far the microarray data for AML have checked are mostly array expression, they dont give the clinical information of the patients which in this case you have for the breast cancer data set.

640Hey kelvin, this is a great tutorial. This is my first time for this kinda analysis, can you please tell how to use data obtained from TCGA both count and clinical data for this analysis.

Thanks

10For quick and easy analysis, you can simply use a website like cBioPortal or oncolnc.org

If you want to do it yourself, here's a good tutorial: Survival analysis of TCGA patients integrating gene expression (RNASeq) data

900Hi Kevin,

Thank you very much for this helpful tutorial. I spent some time to figure out how to do this analysis before coming across your post. I'd appreciate if you can comment on my approach and please let me know if you find it inaccurate.

I downloaded TCGA RNAseq and miRNAseq data and used

`voom`

transformation as follows:Then I combined these normalized data with clinical parameters such as

`vital_status`

and`days_to_death`

to perform survival analysis. I'm recycling this code for 30 separate tumors as a general approach, thus I don't have a predetermined design. I am also trying to calculate correlations between protein-coding-gene vs miRNA pairs to find associations. For my purposes do you think`voom`

normalization is appropriate?Best, Atakan

190Hi Atakan, yes, if I was using data deriving from EdgeR, then I would use the 'voom' expression levels. That is, the voom levels would represent the '

coxdata' object in my tutorial.54k