Tutorial: Survival analysis with gene expression
80
gravatar for Kevin Blighe
20 months ago by
Kevin Blighe61k
Kevin Blighe61k wrote:

For this example, we will load GEO breast cancer gene expression data with recurrence free survival (RFS) from Gene Expression Profiling in Breast Cancer: Understanding the Molecular Basis of Histologic Grade To Improve Prognosis. Specifically, we will encode each gene's expression into Low | Mid | High based on Z-scores and compare these against RFS in a Cox Proportional Hazards (Cox) survival model.

1, read in and prepare data

  library(Biobase)
  library(GEOquery)

  # load series and platform data from GEO
  gset <- getGEO('GSE2990', GSEMatrix =TRUE, getGPL=FALSE)
  x <- exprs(gset[[1]])

  # remove Affymetrix control probes
  x <- x[-grep('^AFFX', rownames(x)),]

  # transform the expression data to Z scores
  x <- t(scale(t(x)))

  # extract information of interest from the phenotype data (pdata)
  idx <- which(colnames(pData(gset[[1]])) %in%
    c('age:ch1', 'distant rfs:ch1', 'er:ch1',
      'ggi:ch1', 'grade:ch1', 'node:ch1',
      'size:ch1', 'time rfs:ch1'))

  metadata <- data.frame(pData(gset[[1]])[,idx],
    row.names = rownames(pData(gset[[1]])))

  # remove samples from the pdata that have any NA value
  discard <- apply(metadata, 1, function(x) any( is.na(x) ))
  metadata <- metadata[!discard,]

  # filter the Z-scores expression data to match the samples in our pdata
  x <- x[,which(colnames(x) %in% rownames(metadata))]

  # check that sample names match exactly between pdata and Z-scores 
  all((colnames(x) == rownames(metadata)) == TRUE)
  ## [1] TRUE

  # create a merged pdata and Z-scores object
  coxdata <- data.frame(metadata, t(x))

  # tidy column names
  colnames(coxdata)[1:8] <- c('Age', 'Distant.RFS', 'ER',
    'GGI', 'Grade', 'Node',
    'Size', 'Time.RFS')

  # prepare phenotypes
  coxdata$Distant.RFS <- as.numeric(coxdata$Distant.RFS)
  coxdata$Time.RFS <- as.numeric(gsub('^KJX|^KJ', '', coxdata$Time.RFS))
  coxdata$ER <- factor(coxdata$ER, levels = c(0, 1))
  coxdata$Grade <- factor(coxdata$Grade, levels = c(1, 2, 3))

2, test each gene independently via Cox regression

With the data prepared, we can now apply a Cox survival model independently for each gene (probe) in the dataset against RFS.

Here we will use RegParallel to fit the Cox model independently for each gene.

  library(survival)
  library(RegParallel)

  res <- RegParallel(
    data = coxdata,
    formula = 'Surv(Time.RFS, Distant.RFS) ~ [*]',
    FUN = function(formula, data)
      coxph(formula = formula,
        data = data,
        ties = 'breslow',
        singular.ok = TRUE),
    FUNtype = 'coxph',
    variables = colnames(coxdata)[9:ncol(coxdata)],
    blocksize = 2000,
    cores = 2,
    nestedParallel = FALSE,
    conflevel = 95)
  res <- res[!is.na(res$P),]
  res

3, annotate top hits with biomaRt

Filter by Log Rank p<0.01

  res <- res[order(res$LogRank, decreasing = FALSE),]
  final <- subset(res, LogRank < 0.01)
  probes <- gsub('^X', '', final$Variable)

  library(biomaRt)
  mart <- useMart('ENSEMBL_MART_ENSEMBL', host='useast.ensembl.org')
  mart <- useDataset("hsapiens_gene_ensembl", mart)
  annotLookup <- getBM(mart = mart,
    attributes = c('affy_hg_u133a',
      'ensembl_gene_id',
      'gene_biotype',
      'external_gene_name'),
    filter = 'affy_hg_u133a',
    values = probes,
    uniqueRows = TRUE)

  annotLookup

Two of the top hits include CXCL12 and MMP10. High expression of CXCL12 was associated with good progression free and overall survival in breast cancer in doi: 10.1016/j.cca.2018.05.041, whilst high expression of MMP10 was associated with poor prognosis in colon cancer in doi: 10.1186/s12885-016-2515-7.

4, encode statistically significant genes as Low | Mid | High and plot survival curves

  # extract RFS and probe data for downstream analysis
  survplotdata <- coxdata[,c('Time.RFS', 'Distant.RFS',
    'X203666_at', 'X205680_at')]

  colnames(survplotdata) <- c('Time.RFS', 'Distant.RFS',
    'CXCL12', 'MMP10')

  # set Z-scale cut-offs for high and low expression
  highExpr <- 1.0
  lowExpr <- -1.0
  survplotdata$CXCL12 <- ifelse(survplotdata$CXCL12 >= highExpr, 'High',
    ifelse(survplotdata$CXCL12 <= lowExpr, 'Low', 'Mid'))
  survplotdata$MMP10 <- ifelse(survplotdata$MMP10 >= highExpr, 'High',
    ifelse(survplotdata$MMP10 <= lowExpr, 'Low', 'Mid'))

  # relevel the factors to have mid as the ref level
  survplotdata$CXCL12 <- factor(survplotdata$CXCL12,
    levels = c('Mid', 'Low', 'High'))
  survplotdata$MMP10 <- factor(survplotdata$MMP10,
    levels = c('Mid', 'Low', 'High'))

  library(survminer)

  ggsurvplot(survfit(Surv(Time.RFS, Distant.RFS) ~ CXCL12,
    data = survplotdata),
    data = survplotdata,
    risk.table = TRUE,
    pval = TRUE,
    break.time.by = 500,
    ggtheme = theme_minimal(),
    risk.table.y.text.col = TRUE,
    risk.table.y.text = FALSE)

a

 ggsurvplot(survfit(Surv(Time.RFS, Distant.RFS) ~ MMP10,
    data = survplotdata),
    data = survplotdata,
    risk.table = TRUE,
    pval = TRUE,
    break.time.by = 500,
    ggtheme = theme_minimal(),
    risk.table.y.text.col = TRUE,
    risk.table.y.text = FALSE)

b

ADD COMMENTlink modified 16 days ago by Hamtaro10 • written 20 months ago by Kevin Blighe61k
1

Kevin,

Unless there is a problem on my end, I think something may have gotten deprecated here. Running code as is only gives me mid and high curves for both genes.

Inspection of

survplotdata <- coxdata[,c('Time.RFS', 'Distant.RFS', 'X203666_at', 'X205680_at')]

shows that no samples meet the -1 zscore low expression cutoff (as far as I can see). I don't really have any questions about this. I just thought I would point it out just in case it is a repeatable error.

ADD REPLYlink modified 14 months ago • written 14 months ago by curious320

Hey, that is strange - thanks for the alert. I also just re-ran my own code and observe the same 'phenomenon'. Nothing surprises me anymore in bioinformatics, though. Bioinformatics is like the Wild Wild West.

Here are the new survival curves for this tutorial:

CXCL12

1

-------------------------

MMP10

2

ADD REPLYlink modified 14 months ago • written 14 months ago by Kevin Blighe61k

Kevin,

I actually do have a quick question related to this now that I think about it (if you have time). I see you have your expression factor with three levels:

survplotdata$CXCL12 <- factor(survplotdata$CXCL12,
levels = c('Mid', 'Low', 'High'))

In theory this was supposed to produce three curves. Lets say I have a similar multi leveled expression factor that produces multiple curves and I want to do a test that makes a pairwise comparison of every single curve. ie low vs mid, mid vs high etc.

I should just be able to run this command at endpoint which as I understand gives a benjamini hochberg adjusted log-rank test p value for every possible comparison of the multiple curves.

pairwise_survdiff(Surv(Time.RFS, Distant.RFS) ~ CXCL12, data=survplotdata)

Is this reasonable?

Thank you

ADD REPLYlink modified 14 months ago • written 14 months ago by curious320

Yes, well, in the example above (my example), we could have done it better by dividing the expression range into tertiles to ensure that there would be at least 1 sample per group. I just chose a hard cut-off of Z=1, though. The tutorial is just to foment ideas, though.

I am not familiar with pairwise_survdiff() but it looks like a useful function. Thanks for mentioning it here.

ADD REPLYlink written 14 months ago by Kevin Blighe61k

Hi Kevin. Hope you good. I adapted your code to find the high, low and mid expressions of 14 genes. Can you please help me with a tutorial on how to conduct a pairwise survival plot possibly one that can pair say high level of TPL2 and VEGFA and low level of IGFBP3?

I already tried this but I didnt understand most of it

http://rstudio-pubs-static.s3.amazonaws.com/5896_8f0fed2ccbbd42489276e554a05af87e.html

Thanks.

ADD REPLYlink modified 4 days ago • written 4 days ago by silsie6450

I am unsure what you mean, but you can create a multivariate Cox model of the following form:

 ~ TPL2 + VEGFA + IGFBP3

...or, just create a new variable that contains every possible combinatino of high | low for these genes and then just use that in the Cox model.

ADD REPLYlink written 4 days ago by Kevin Blighe61k

Dear Dr. Kevin,

As in the K-M plot clear, after running ggsurvplot we plot Kaplan Meyer which we can see a p-value on it. Here for "MMP10", the p-value equals 0.00047 in your example. I ran the same as your code for my target gene and also ran the Cox Proportional-Hazards Model for that.

In my case, the p-value resulted from the Cox regression is 0.04 but the p-value resulted ggsurvplot for the K-M plot is about 0.1. based on Cox's p-value my study is significant but based on the K-M plot p-value isn't(greater than 0.05). is that results logically acceptable? If yes which p-value should be ignored and which one accepted?

I appreciate it if you share your comment with me.

ADD REPLYlink modified 5 months ago • written 5 months ago by modarzi120

Please show the exact code that you have used in order to clearly show from where you are deriving your p-values.

ADD REPLYlink written 5 months ago by Kevin Blighe61k

For cox regression I use below code:

     fit_SARC_turquoise <- coxph(Surv(coxSARCdata$OS.days, coxSARCdata$OS.status)~MEturquoise, data=coxSARCdata)
summary(fit_SARC_turquoise)

And by runnig that code I got below result:

    n= 71, number of events= 26 

                coef      exp(coef)     se(coef)      z        Pr(>|z|)  
     MEturquoise -0.4242    0.6543     0.2058        -2.061   0.0393 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

               exp(coef)  exp(-coef)  lower .95   upper .95
MEturquoise    0.6543      1.528       0.4371        0.9794

Concordance= 0.602  (se = 0.07 )
Likelihood ratio test= 4     on 1 df,   p=0.05
Wald test            = 4.25  on 1 df,   p=0.04
Score (logrank) test = 4.34  on 1 df,   p=0.04

As you see the P-Value(Pr(>|z|)) equal 0.0393. now in the following I performed K-M plot generating code:

#peparing subsetting MEturquoise from coxSARCdata
survplotSARCturquoisedata <- coxSARCdata[,c('OS.days', 'OS.status',
                                        'MEturquoise')]

survplotSARCturquoisedata$MEturquoise <- ifelse(survplotSARCturquoisedata$MEturquoise > highExpr, 'High','Low')

# plot based on Kaplan-Meier plot
ggsurvplot(survfit(Surv(OS.days, OS.status)~ MEturquoise, data=survplotSARCturquoisedata),
           data = survplotSARCturquoisedata,
           risk.table = TRUE,
           pval = TRUE,
           break.time.by = 500,
           ggtheme = theme_minimal(),
           risk.table.y.text.col = TRUE,
           risk.table.y.text = FALSE)

So, in the following link the result of K-M plot is accecible. and you can see P-value in the plot equals 0.25:

https://www.dropbox.com/s/8rn89ithvqfyfqk/Rplot_K-M_MEturquoise_OS_981018.bmp?dl=0

I appreciate it if you share your comment with me

ADD REPLYlink modified 5 months ago • written 5 months ago by modarzi120

These are different functions, so, you should not expect that they return the same p-values. Take a look here:

ADD REPLYlink written 5 months ago by Kevin Blighe61k

Dear Dr. Blighe Thanks for your comment. but as I wrote in the last line of summary(fit_SARC_turquoise) result you can find Score (log rank) test in which the p-value equals 0.04 by 1 df. but this log rank p-value is different from p-value in K-M plot in this link: https://www.dropbox.com/s/8rn89ithvqfyfqk/Rplot_K-M_MEturquoise_OS_981018.bmp?dl=0

I appreciate it if you share your comment with me

ADD REPLYlink modified 5 months ago • written 5 months ago by modarzi120

Is survplotSARCturquoisedata the exact same as coxSARCdata?

ADD REPLYlink written 5 months ago by Kevin Blighe61k

No, because coxSARCdata has a few columns and survplotSARCturquoisedata is a subset of coxSARCdata. you mean for that reason they don't have similar P-value. Ok, Dear Dr. Blighe, how can I interpret this unsimilarity of 2 log-rank P-value resulted from the Cox regression and K-M plot? Can I insert P-value resulted from Cox regression in the K-M plot picture instead K-M plot P-value?

ADD REPLYlink written 5 months ago by modarzi120

Yes, you can add any p-value to the K-M plot - all that you need to do is:

...
pval = '0.04',
...

However, you need to be sure that this is the correct thing to do.

ADD REPLYlink written 5 months ago by Kevin Blighe61k

Dear Dr. Blighe, I have 2 more questions:

1- I need to show K-M plots for 7 genes in one picture. How can I do it?

2- I need to resize of Font of labels(Survival probability, time,..) in the K-M plot.

I appreciate it if you guide me that how can I do them via my code.

ADD REPLYlink modified 5 months ago • written 5 months ago by modarzi120

1- I need to show K-M plots for 7 genes in one picture. How can I do it?

2- I need to resize of Font of labels(Survival probability, time,..) in the K-M plot.

ADD REPLYlink written 5 months ago by Kevin Blighe61k

Does that help?

ADD REPLYlink written 5 months ago by Kevin Blighe61k

So I tried to perfom this analysis with my data:

#loading data from GEO gset <- getGEO('GSE17536', GSEMatrix =TRUE, getGPL=FALSE) x<-exprs(gset[[1]])

#removing Affymetrix control probes 
x<-x[-grep('^AFFX', rownames(x)),] 

#transforming expression data to Z scores 
x<-t(scale(t(x)))

# extract information of interest from the phenotype data (pdata)
idx<-which(colnames(pData(gset[[1]])) %in% c('age:ch1', 'ajcc_stage:ch1', 'dfs_event (disease free survival; cancer recurrence):ch1', 'dfs_time:ch1', 'gender:ch1', 'overall survival follow-up time:ch1', 'overall_event (death from any cause):ch1'))
metadata<-data.frame(pData(gset[[1]])[,idx],row.names=rownames(pData(gset[[1]])))

#removing samples with NA values
discard<-apply(metadata,1,function(x) anyis.na(x))) 
metadata<-metadata[!discard,] 

#filter the Z-scores expression data to match the samples in our data 
x<-x[,which(colnames(x) %in% rownames(metadata))] 

#checking that sample names match exactly bewteen pdata and Z-scores 
all((colnames(x)==rownames(metadata))==TRUE) 

#converting data to 0/1
metadata[1][metadata[1]=="recurrence"] <- 1
metadata[1][metadata[1]=="no recurrence"] <- 0

#creating merged data and Z-scores 
coxdata<-data.frame(metadata,t(x)) 

#tidying column names 
colnames(coxdata)[1:7]<-c('Age', 'Ajcc_stage', 'dfs_event', 'dfs_time', 'Gender', 'OverallSurvival_time', 'Overall_event')

#preparing phenotypes 
coxdata$dfs_event<-as.numeric(coxdata$dfs_event)
coxdata$dfs_time<-as.numeric(coxdata$dfs_time)
coxdata$OverallSurvival_time<-as.numeric(coxdata$OverallSurvival_time)
coxdata$Overall_event<-as.numeric(coxdata$Overall_event)

#testing each gene with the data (RFS first)
Res1<-RegParallel(data=coxdata,formula='Surv(dfs_time, dfs_event)~[*]',FUN=function(formula,data) coxph(formula=formula, data=data, ties='breslow',singular.ok=TRUE), FUNtype='coxph', variables=colnames(coxdata)[8:ncol(coxdata)],blocksize = 2000, cores=2,nestedParallel=FALSE, conflevel=95)

But I keep getting

index1: 54001; index2: 54613 Error in { : task 1 failed - "No (non-missing) observations" after the RegParallel command. Can you tell me why please? P. S: the dataset recorded dfs_event as 'recurrence' and 'no recurrence' and Overall_event as 'death' and 'no death'

ADD REPLYlink written 11 days ago by silsie6450

Hey, I think that it means that you have a variable that has no values, i.e., a variable that has only NA or infinite values, Have you screened your input data to ensure that all variables are complete?

ADD REPLYlink modified 11 days ago • written 11 days ago by Kevin Blighe61k

Hi I realised that whenever I executed the commands:

#preparing phenotypes 
coxdata$dfs_event<-as.numeric(coxdata$dfs_event)
coxdata$Overall_event<-as.numeric(coxdata$Overall_event),

the values for these columns would all change to NA. I did this a number of times and got the same result. Please do you know why this keeps happening?

ADD REPLYlink written 10 days ago by silsie6450

Please ignore the comma at the end of the code.

ADD REPLYlink written 10 days ago by silsie6450

That's a change introduced in R 4.0.0. I will have to modify the tutorial code.

You should try:

coxdata$dfs_event <- as.numeric(as.character(coxdata$dfs_event))
coxdata$Overall_event <- as.numeric(as.character(coxdata$Overall_event))
ADD REPLYlink modified 10 days ago • written 10 days ago by Kevin Blighe61k

Hey I tried that as well after seeing on a platform like this but I got the same response. Tried again this morning and got the same NA problem.

ADD REPLYlink written 10 days ago by silsie6450

I also restarted R and re-executed the codes but I keep getting the same response.

ADD REPLYlink written 10 days ago by silsie6450

Hello again. What is the output of:

str(coxdata$dfs_event)
ADD REPLYlink written 9 days ago by Kevin Blighe61k
num [1:177] NA NA NA NA NA NA NA NA NA NA ...
ADD REPLYlink written 9 days ago by silsie6450

I see, but this is not an issue with my tutorial. You need to properly encode your DFS variables. Here is the pData for your dataset:

df[,grep('dfs', colnames(df))]
          dfs_event (disease free survival; cancer recurrence):ch1 dfs_time:ch1
GSM437093                                            no recurrence       142.55
GSM437094                                            no recurrence       122.72
GSM437095                                            no recurrence        28.96
GSM437096                                            no recurrence       119.21
GSM437097                                            no recurrence        59.53
GSM437098                                               recurrence        27.02
GSM437099                                            no recurrence        82.29
GSM437100                                            no recurrence       109.21
ADD REPLYlink written 9 days ago by Kevin Blighe61k

Hello Kevin. Sorry am quite new to R. Please what do you mean when by properly encoding my DFS variables. I also tried to execute the code above and I got this instead:

> df[,grep('dfs',colnames(df))]
Error in df[, grep("dfs", colnames(df))] : 
  object of type 'closure' is not subsettable
ADD REPLYlink written 9 days ago by silsie6450

I see.. trying to adapt this tutorial to your own data will prove difficult for people who are new to R.I recommend that you first go through the entire tutorial as I have presented (above) - in this way, you will be better equipped to later adapt the code to your own data.

ADD REPLYlink written 9 days ago by Kevin Blighe61k

Hi Kevin, I read the as.numeric(as.character(x)) converts my data from factor to character and then to numeric. So I tried this code:

dfs_event<-as.numeric(as.factor(metadata$dfs_event..disease.free.survival..cancer.recurrence..ch1))

hoping that the data will be converted from character to factor to numeric.

And I got this response: dfs_event

[1] 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 2 1 3
 [33] 2 2 3 2 2 3 2 2 3 2 2 2 2 2 3 2 2 3 3 2 3 2 2 2 2 2 2 2 2 2 2 2
 [65] 2 2 2 2 2 2 2 2 1 2 2 2 3 2 2 2 2 3 2 3 3 2 2 3 3 3 3 3 2 2 3 3
 [97] 3 2 2 2 2 2 2 2 3 2 3 3 3 2 2 1 2 2 2 2 2 2 3 2 2 3 2 2 2 2 2 2
[129] 2 3 3 2 2 2 3 2 2 2 1 2 1 1 1 1 1 2 1 1 1 1 3 1 1 1 1 3 2 1 1 1
[161] 1 1 2 1 1 1 1 1 3 1 1 2 2 3 1 1 1

where 1: NA, 2: no recurrence, 3: recurrence. Am wondering if this will this affect my COX analysis?

ADD REPLYlink written 9 days ago by silsie6450

Hello again. The Cox regression function that is used in this tutorial requires data to be:

0, no recurrence / no death
1, recurrence / death

As per my own code (above):

coxdata$Distant.RFS 
  [1] 0 0 1 1 1 0 1 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0
 [38] 0 0 0 0 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 1 0 1 0 1 1
 [75] 1 0 0 1 1 1 0 1 0 1 1 1 0 1 1 0 1 1 0 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 0
[112] 0 1 1 1 0 1 0 0 0 0 0 1 0 1

You will have to encode your variable as 0 and 1.

Take a look at the sub() and gsub() functions

ADD REPLYlink written 9 days ago by Kevin Blighe61k

Hi. So this is what I eventually and it seemed to work:

coxdata$dfs_event<- as.numeric(as.factor(coxdata$dfs_event))
coxdata$dfs_event<-sub(2,1,coxdata$dfs_event)
coxdata$dfs_event<-sub(3,0,coxdata$dfs_event)
coxdata$dfs_event<-sub(1,'NA',coxdata$dfs_event)
coxdata$dfs_event<-as.numeric(coxdata$dfs_event)
coxdata$dfs_time <-as.numeric(coxdata$dfs_time)
coxdata$OverallSurvival_time<-as.numeric(coxdata$OverallSurvival_time)
coxdata$Overall_event<- as.numeric(as.factor(coxdata$Overall_event))
coxdata$Overall_event<-sub(2,0,coxdata$Overall_event)
coxdata$Overall_event<-as.numeric(coxdata$Overall_event)
ADD REPLYlink written 9 days ago by silsie6450

Sure, but, where you use as.numeric(as.factor()) together in this way, you need to be careful about how it converts the factors into numbers - the behaviour may not always be what you expect

ADD REPLYlink written 9 days ago by Kevin Blighe61k

Hi Kevin. This may seem odd but I will like to know how R interprets:

ggsurvplot(survfit(Surv(OverallSurvival_months, Death)~TPL2...

and

ggsurvplot(survfit(Surv(OverallSurvival_months)~TPL2...

This is because when I used the second to plot a that had a p value of 0.0024 making the relation significant (which was expected) but the first plot gave a p value of 0.32. I got the first code from a friend who was helping me out.

ADD REPLYlink modified 8 days ago • written 8 days ago by silsie6450

I would indeed expect different p-values here because the parameters that are passed to Surv() are interpreted differently based on how many are passed. Take a look at ?Surv, or here: https://www.rdocumentation.org/packages/survival/versions/3.2-3/topics/Surv

ADD REPLYlink written 8 days ago by Kevin Blighe61k
1

@Kevin Blighe:

Many thanks for your community contribution in Biostars, this thread is very informative and helpful to learn RNA-Seq analysis.

ADD REPLYlink written 12 months ago by Jurat Shahidin80

One typo was found: discard <- apply(metadata, 1, function(x) anyis.na(x))) should be discard <- apply(metadata, 1, function(x) anyis.na(x)))

ADD REPLYlink written 17 months ago by crazy_ice0

Thank you for noticing that. My raw code was actually correct - the error (the lack of an extra parenthesis, (), was introduced in the visual representation of my code by the Biostars rendering system. I have added a space, and it now looks fine.

ADD REPLYlink written 17 months ago by Kevin Blighe61k
1

Gud one Kevin. I have a question. On what basis Z-scale cutoff 1.0 is selected?

BTW In this tutorial [http://r-addict.com/2016/11/21/Optimal-Cutpoint-maxstat.html] they have used maxstat (Maximally selected rank statistics) for the cutpoint to classify samples into high and low.

ADD REPLYlink written 17 months ago by Biologist190
1

That looks like a good tutorial (through the link that you posted).

The selection of absolute Z=1 was just chosen as a very relaxed threshold for highly / lowly expressed. It is difficult to know where the exact cut-offs should be, and of course biology does not intuitively work on cut-off points. The tutorial above is for fomenting new ideas for survival analysis.

ADD REPLYlink modified 17 months ago • written 17 months ago by Kevin Blighe61k

Dear Dr. Blighe

Thank you for this tutorial. I have a question about using Scale() for transforming expression data to Z scores. basically, why do we need transforming to z scores while our original data(downloaded from GEO) is normal?

I appreciate if you share your comment with me.

Mohammad

ADD REPLYlink modified 17 months ago • written 17 months ago by modarzi120
3

Hello Mohammad. The conversion to Z-scores provides for an easier interpretation on the expression range for each gene. For example, on the Z-scale, we know that +3 equates to 3 standard deviations above the mean expression value in the dataset.

In a normal distribution that is not transformed to Z-scale, a value of 10, 20, 30, et cetera may mean nothing in the context of the expression range.

The relationship between a normal distribution and the Z-scale is emphasised in this beautiful figure:

d

[source: https://www.mathsisfun.com/data/standard-normal-distribution.html]

ADD REPLYlink written 17 months ago by Kevin Blighe61k

Dear Dr. Blighe

Thanks for your answer. I have another questions about your SA tutorial due to using RNA-seq expression data:

1-Generally, the measure of expression in RNA-seq is count and different from measure of expression in Microarray Technology. So, for using RNA-seq, Should I modify your survival analysis code? special in Standardization step?

2- honestly, I cant understand '~ [*]' in formula = 'Surv(Time.RFS, Distant.RFS) ~ [*]'

3- phenotype of my data set has fours fields: 'OS status','OS days','RFS status','RFS days'. So, based on RegParallel(), can I compute 'res' using my phenotype fields? if yes, how can I use these fields in RegParallel()?

I deeply appreciate if you share your comment with me.

Mohammad

ADD REPLYlink modified 17 months ago • written 17 months ago by modarzi120
2

1-Generally, the measure of expression in RNA-seq is count and different from measure of expression in Microarray Technology. So, for using RNA-seq, Should I modify your survival analysis code? special in Standardization step?

You should aim to transform your normalised RNA-seq counts via the variance-stabilised or regularised log transformation (if using DESeq2), or produce log CPM counts (if using EdgeR). Then, you can generally use glm(), as I use above. If you are aiming to use the normalised, un-transformed counts, then you could use the negative binomial regression via glm.nb() - this may be too advanced, though. Check the manual (via ?RegParallel) and vignette for RegParallel.

Remember that, in RNA-seq, the general process goes:

  1. raw counts -->
  2. normalised counts (statistical analyses performed on these) -->
  3. transformed, normalised counts (for downstream analyses, clustering, PCA, etc.)

2- honestly, I cant understand '~ [*]' in formula = 'Surv(Time.RFS, Distant.RFS) ~ [*]'

This is annotation specific to my package, RegParallel. Each gene will replace the [*] symbol as the package tests each gene in an independent model. Again, please read the manual and vignette.

3- phenotype of my data set has fours fields: 'OS status','OS days','RFS status','RFS days'. So, based on RegParallel(), can I compute 'res' using my phenotype fields? if yes, how can I use these fields in RegParallel()?

Then you are likely aiming to do a survival analysis. In that case, you can use coxph(). Your commands would be:

OS <- RegParallel(
    data = coxdata,
    formula = 'Surv(OS.days, OS.status) ~ [*]',
    FUN = function(formula, data)
      coxph(formula = formula,
        data = data,
        ties = 'breslow',
        singular.ok = TRUE),
    FUNtype = 'coxph',
    variables = colnames(coxdata),
    blocksize = 500,
    cores = 2,
    nestedParallel = FALSE,
    conflevel = 95)

...or:

RFS <- RegParallel(
    data = coxdata,
    formula = 'Surv(RFS.days, RFS.status) ~ [*]',
    FUN = function(formula, data)
      coxph(formula = formula,
        data = data,
        ties = 'breslow',
        singular.ok = TRUE),
    FUNtype = 'coxph',
    variables = colnames(coxdata),
    blocksize = 500,
    cores = 2,
    nestedParallel = FALSE,
    conflevel = 95)

Note, you will likely have to change the value to variables. Variables is a vector of gene names that you want to test. It can be any number. If using RegParallel, the idea is that you have hundreds or thousands or millions of genes to test.

ADD REPLYlink written 17 months ago by Kevin Blighe61k

"normalised counts (statistical analyses performed on these) -->" i have this doubt found that you mentioned about it , are you saying about this function "counts(dds, normalized=TRUE)" whose value can be used for any non parametric statistical lest?

As of now i used mostly rlog and vst value for clustering and pca etc . Not sure if i can use any test on the rlog or vst value , as im not sure if its mathematically correct thing to do with the rlog or vst value or not...

If you can clarify it would be really helpful

ADD REPLYlink written 16 months ago by krushnach80810
1

No, it is just in the DESeq2 protocol (and EdgeR). The statistical comparisons are conducted on the normalised, un-transformed counts, which follow a negative binomial distribution. DESeq2 derives p-values, generally, as follows:

  1. fit negative binomial regression model independently for each gene's normalised counts
  2. extract p-value from the model coefficient via the Wald test applied to the model

*there are variations to this

One can, of course, produce normalised, transformed counts, and perform their own analyses on these.

ADD REPLYlink modified 16 months ago • written 16 months ago by Kevin Blighe61k

"No, it is just in the DESeq2 protocol (and EdgeR). The statistical comparisons are conducted on the normalised, un-transformed counts, which follow a negative binomial distribution. DESeq2 derives p-values, generally, as follows:

fit negative binomial regression model independently for each gene's normalised counts extract p-value from the model coefficient via the Wald test applied to the model" yes this part im clear as i read the same in the paper

"of course, produce normalised, transformed counts, and perform their own analyses on these." yes for this one as i get certain genes and i want to make comparison between biological sample .So if i do that comparison running some non parametric test then its not a problem , I guess

ADD REPLYlink written 16 months ago by krushnach80810

I see. Which test are you doing?

ADD REPLYlink written 16 months ago by Kevin Blighe61k

I tried wilcoxon rank test as of now...

ADD REPLYlink written 16 months ago by krushnach80810
1

Seems okay to me. Ask 10 people and you'll get 10 different answers, though. Each answer is based on the respective experience of the individual.

ADD REPLYlink written 16 months ago by Kevin Blighe61k

one doubt is clear now

ADD REPLYlink written 16 months ago by krushnach80810

Dear Dr. Blighe

I use TPM(Transaction per million) method for normalizing my RNA-Seq data set. base on your perfect tutorial I ran RegParallel() for getting survival analysis. I need your comment for 2 below questions:

1- I use 'coxph' as FUNtype for the regression model. is it a suitable function for my problem. if no, which function is your suggestion?

2- As you know in literature, we have multivariate Cox regression and univariate Cox regression. based on RegParallel(), our Survival Analysis is multivariate or univariate?

It would be really helpful If you can clarify me.

ADD REPLYlink modified 12 months ago • written 12 months ago by modarzi120

Yes, coxph is the correct function. TPM is not too bad if you are testing each gene independently, i.e., univariate (in my tutorial, above, each gene is tested independently as part of a univariate Cox model);

ADD REPLYlink written 12 months ago by Kevin Blighe61k

Ok, Thanks for your comment. logically, doing multivariate Cox Regression for lots of genes(more than 150 genes) is true? if you agree, how can I run it? can you guide me by tutorial such as the above tutorial?

ADD REPLYlink written 12 months ago by modarzi120

If you want to run a multivariate Cox model with that many variables, then consider using the Lasso Cox model: https://cran.r-project.org/web/packages/glmnet/vignettes/Coxnet.pdf

ADD REPLYlink written 12 months ago by Kevin Blighe61k

Dear Dr. Blighe

Really Thanks for your answer. But about my first question, I would like to explain more about my data set. It belongs to TCGA and I downloaded as UQ-FPKM. In RNA-seq analysis, this type of data set is normal. So, for using that I transformed it to Log2 space.

1- now, for using this data should I scale() for transformation to z-score?

2- based on my explanationabout TCGA data, which functions are better: glm() or glm.nb()?

3- why you didn't use coxph() for RNA-seq expression data set in RegParallel vignett? can I use this function for my data set?

I deeply appreciate if you share your comment with me.

Mohammad

ADD REPLYlink modified 17 months ago • written 17 months ago by modarzi120

Sorry, this is not how Biostars functions. I expect you to read my comments and to then spend some time researching the answers to any further questions that you have. That is the best form of learning.

ADD REPLYlink written 17 months ago by Kevin Blighe61k

Dear Dr.Blighe

Here you design Survival plot for 2 genes: 'MMP10' and 'CXCL12'. Suppose that we have a bunch of gene and after clustering we have n cluster. how can we design Surv plot for each cluster separately?

I appreciate if you share your solution with me

ADD REPLYlink modified 17 months ago • written 17 months ago by modarzi120

What do you mean by 'n cluster'?

ADD REPLYlink written 17 months ago by Kevin Blighe61k

n is number of cluster. For example 3 cluster(n=3).

ADD REPLYlink written 17 months ago by modarzi120

If you encode the gene's expression as a factor / categorical variable, then the survival function will plot a curve for each level

ADD REPLYlink written 17 months ago by Kevin Blighe61k

Hi Kevin, I am curious to ask can we use Beta values for methylation from each probe instead of the read-count from gene expression.

If yes, these values are continuous and range from 0 to 1, would it be recommended to convert these also to Z score.

ADD REPLYlink written 16 months ago by Researcher50

Hey, yes, you could use the Beta values from methylation for the purposes of survival analysis. I think that it is okay to leave the values as 0 to 1.

ADD REPLYlink written 16 months ago by Kevin Blighe61k

Thanks Kevin, I tried your suggestion and was able to identify prognostic CpG sites. I did the same using gene expression data and interestingly found some overlapping genes. But I am not very sure how to integrate these two results as methylation can regulate the expression of genes that are in trans. In order to address that, checking just the overlap would not work. I do not know how should I proceed. I will really appreciate if u can share your thoughts about it.

Thanks B

ADD REPLYlink modified 16 months ago • written 16 months ago by Researcher50

Hi Dr. Blighe, My survplotdata is as below:

             OS.days     OS.status     MEbrown
TCGA_K1_A3PO    1622         1       -2.33294095
TCGA_IE_A4EI     594         1        0.58743015
TCGA_DX_A6B7    1143         1       -0.54661312
TCGA_HB_A5W3     384         0       -0.54716366
TCGA_WK_A8XS    2579         1        0.78227254
TCGA_X6_A8C5    1547         1       -1.38013989
TCGA_QQ_A5VC    1092         1       -0.35551612
TCGA_DX_A48U    3310         1       -1.10047385
TCGA_DX_A3UA     325         0       -0.31312280
TCGA_DX_A3UC    2085         1        1.06574187
TCGA_DX_A48R     805         1        0.88281156
TCGA_PC_A5DO    2464         0        0.06537203
TCGA_IE_A4EK      42         1       -0.80047920
TCGA_DX_A6Z2    3079         1       -0.16302605
TCGA_K1_A42W    1891         1        0.52228371

I used 0 as cut-offs for high and low expression

 highExpr <- 0
 lowExpr <- 0

When I run this code:

survplotdata$MEbrown <- ifelse(survplotdata$MEbrown > highExpr, 'High',
                                  ifelse(survplotdata < lowExpr, 'Low'))

I get this Error:

Error in ifelse(survplotdata < lowExpr, "Low") : 
  argument "no" is missing, with no default

I appreciate if you guide me and share your comment for solving that Error with me.

ADD REPLYlink written 12 months ago by modarzi120
1

Take a look at ?ifelse

ADD REPLYlink written 12 months ago by Kevin Blighe61k

Thanks, Dr. Blighe. I solved my problem but in the below code:

ggsurvplot(survfit(Surv(OS.days, OS.status) ~ MEbrown,
                   data = survplotdata),
           data = survplotdata,
           risk.table = TRUE,
           pval = TRUE,
           break.time.by = 500,
           ggtheme = theme_minimal(),
           risk.table.y.text.col = TRUE,
           risk.table.y.text = FALSE)

I get below Error:

Error in as.data.frame.default(x[[i]], optional = TRUE) : 
  cannot coerce class ‘"by"’ to a data.frame
In addition: Warning messages:
1: In .pvalue(fit, data = data, method = method, pval = pval, pval.coord = pval.coord,  :
  There are no survival curves to be compared. 
 This is a null model.

I appreciate if you guide me and share your comment for solving that Error with me.

ADD REPLYlink written 12 months ago by modarzi120

Okay, please spend some more time to debug the error on your own. Check the encoding of your variables, and check what survfit() and ggsurvplot() expect.

ADD REPLYlink written 12 months ago by Kevin Blighe61k

Hi Kevin, thanks for creating this package. I'm learning survival analysis, and am finding your tutorial is very helpful.

I was wondering regarding your suggestion to arrange the tests by log rank p value. From my understanding, the log rank test is computed comparing survival time between groups. So in the RegParallel function, is gene expression being dichotomized? If so, how exactly---is it using Z-score +/- 1? Moreover, because gene expression is continuous, would it not make sense to select 'statistically significant' genes based on p value (and adjust those instead of the log rank p value)?

Thanks, Victor

ADD REPLYlink written 12 months ago by victor.2wy0

So in the RegParallel function, is gene expression being dichotomized? If so, how exactly---is it using Z-score +/- 1?

No, the package just accepts whatever data that you use. It can be continuous or categorical. It is just in this tutorial that I dichotomise the gene expression values before using the RegfParallel package.

Moreover, because gene expression is continuous, would it not make sense to select 'statistically significant' genes based on p value (and adjust those instead of the log rank p value)?

You can do whatever approach seems valid to you. There is no correct or incorrect approach.

ADD REPLYlink written 11 months ago by Kevin Blighe61k

without clinical information this is not possible to do so isn;t it? If i look at the microarray data of liquid tumor they dont give information as such as you have used here. Is there still a way to run survival analysis ?

ADD REPLYlink written 11 months ago by krushnach80810

Hey, what information do you have, exactly?

ADD REPLYlink written 11 months ago by Kevin Blighe61k

so far the microarray data for AML have checked are mostly array expression, they dont give the clinical information of the patients which in this case you have for the breast cancer data set.

ADD REPLYlink modified 11 months ago • written 11 months ago by krushnach80810

Hey kelvin, this is a great tutorial. This is my first time for this kinda analysis, can you please tell how to use data obtained from TCGA both count and clinical data for this analysis.

Thanks

ADD REPLYlink written 6 months ago by kartikayprasad10

For quick and easy analysis, you can simply use a website like cBioPortal or oncolnc.org

If you want to do it yourself, here's a good tutorial: Survival analysis of TCGA patients integrating gene expression (RNASeq) data

ADD REPLYlink written 6 months ago by dsull1.4k

Hi Kevin,

Thank you very much for this helpful tutorial. I spent some time to figure out how to do this analysis before coming across your post. I'd appreciate if you can comment on my approach and please let me know if you find it inaccurate.

I downloaded TCGA RNAseq and miRNAseq data and used voom transformation as follows:


dge_rna <- DGEList(counts = rna_assay, samples = rna_sample_meta)  
keep_rna <- filterByExpr(dge_rna)  
dge_rna <- dge_rna[keep,,keep.lib.sizes=FALSE]  
dge_rna <- calcNormFactors(dge_rna)  
v_rna <- voom(dge_rna, plot=F)
norm_rna <- v$E

dge_mir <- DGEList(counts = mirna_assay, samples = mirna_sample_meta)  
keep_mir <- filterByExpr(dge_mir)  
dge_mir <- dge_mir[keep,,keep.lib.sizes=FALSE]  
dge_mir <- calcNormFactors(dge_mir)  
v_mir <- voom(dge_mir, plot=F)
norm_mir <- v_mir$E

Then I combined these normalized data with clinical parameters such as vital_status and days_to_death to perform survival analysis. I'm recycling this code for 30 separate tumors as a general approach, thus I don't have a predetermined design. I am also trying to calculate correlations between protein-coding-gene vs miRNA pairs to find associations. For my purposes do you think voom normalization is appropriate?

Best, Atakan

ADD REPLYlink written 4 months ago by atakanekiz250

Hi Atakan, yes, if I was using data deriving from EdgeR, then I would use the 'voom' expression levels. That is, the voom levels would represent the 'coxdata' object in my tutorial.

ADD REPLYlink written 4 months ago by Kevin Blighe61k

Dear Kevin, excellent and comprehensive tutorial as always !! I have three quick questions regarding the implementation of your tutorial: briefly, based on the TCGA-GDC RNA-Seq dataset of breast cancer, i have identified a very small number of genes (~5) with significant differences in overall survival, based on the stratification of cancer samples as high vs low. My next goal is to search additional datasets-even microarrays-to test the same hypothesis, as also if subtype available to correlate it also with survival. Thus, my quick questions are the following:

1) Regarding the pre-processing of microarray data-you scaled only the data, as you have downloaded an already normalized gene expression matrix correct ?

2) I saw you have performed cox regression on relapse-free survival- checked also from the supplementary material, that some of the patients have not received any type of therapy-thus, from my goal and perspective, I can still perform survival using RFS, even to test if these genes exhibit a correlation with survival associated with therapy, even if it is not overall survival ?

3) Even if i have specific gene targets, I can still perform cox regression to investigate if these genes illustrate a significant outcome associated with survival ?

Best,

Efstathios

ADD REPLYlink written 4 months ago by svlachavas680
1

1) Regarding the pre-processing of microarray data-you scaled only the data, as you have downloaded an already normalized gene expression matrix correct ?

Yes, that is correct, i.e., the data is already normalised (and log [base 2] transformed).

2) I saw you have performed cox regression on relapse-free survival- checked also from the supplementary material, that some of the patients have not received any type of therapy-thus, from my goal and perspective, I can still perform survival using RFS, even to test if these genes exhibit a correlation with survival associated with therapy, even if it is not overall survival ?

Yes, you can perform survival analysis using any metric. It can be 'days to relapse', 'days to death', 'days to first disease occurrence', etc. The term 'survival' was always somewhat misleading.

3) Even if i have specific gene targets, I can still perform cox regression to investigate if these genes illustrate a significant outcome associated with survival ?

Yes, and you can include all genes in the same model, or test each gene independently, i.e., in separate models

ADD REPLYlink written 4 months ago by Kevin Blighe61k

Dear Kevin,

thank you very much for your answer !! do you think that based on the experimental design of this dataset-that is the majority of the patients have undergone initial therapy-RFS would be a more "robust" estimate of survival,as essentially if measuring overall survival, is more related to patients without any therapy ? and then I can assume if a statistically significant RFS survival appears, that any gene related is implicated in survival mechanisms related to therapy ? as a measure of resistance ?

ADD REPLYlink written 4 months ago by svlachavas680
1

I cannot confidently answer these follow up questions

ADD REPLYlink written 4 months ago by Kevin Blighe61k

Hi Kevin,

Great tutorial, thanks so much for taking the time to write and share it. I would like to ask a question just to clarify my understanding. Apologies if this is very simple/obvious, I am coming from a pure biology background with not much statistical training.

Am I correct in thinking your code is performing a univariate analysis on each gene? I can see the model is looping to test each variable separately, and that the variables are defined as each gene in the below line:

variables = colnames(coxdata)[9:ncol(coxdata)]

However I am struggling the understand, whether/where the phenotype data (age, ER status, grade etc) is being used by the model. Is it referenced by assigning the data as the full 'coxdata' dataframe, as below?

res <- RegParallel(
    data = coxdata,
    ...
    data = data,
...

If so, is this different from passing the phenotype data as an explicit variable(s) and performing a multivariate analysis on each gene in conjunction with the phenotype data?

I appreciate any advice or direction to further reading to improve my understanding!

Thanks again,

Sian

ADD REPLYlink written 8 weeks ago by sian.42.goldie0
1

Hey Sian, yes, it performs a univariate test on each gene / variable that is passed to the variables parameter. This is the same as any standard differential expression program.

If you want to adjust for a covariate, say, ER-status, then you would do something like:

res <- RegParallel(
    data = coxdata,
    formula = 'Surv(Time.RFS, Distant.RFS) ~ [*] + ERstatus',
    FUN = function(formula, data)
      coxph(formula = formula,
        data = data,
        ties = 'breslow',
        singular.ok = TRUE),
    FUNtype = 'coxph',
    variables = colnames(coxdata)[9:ncol(coxdata)],
    blocksize = 2000,
    cores = 2,
    nestedParallel = FALSE,
    conflevel = 95)

I'm aware that the syntax of this package's commands is not too easy to interpret but, in certain respects, I wanted it to be that way in order to avoid any mis-use. To use it, one has to have a general understanding of regression modeling, i suppose.

PS - that will output a line for ERstatus for each gene, so, you may want to automatically exclude those model terms via the excludeTerms parameter.

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by Kevin Blighe61k

Hello Dr. Kevin. Thank you very much for these tutorials. You helping thousands of students from all over the world (Here one from Spain).

I've adapted your code to my HTA 2.0 microarray studio. And I've gone from having 350 candidate genes to 35 genes that influence patient survival. My boss told me I might be able to reduce the number of genes using a multivariable model.

My question is whether your code can be used with a penalized COX multivariable model. The study I am doing is with prostate cancer, and I have many clinical factors that may be helpful (PSA, alkaline phosphatase etc.)

Thank you again

ADD REPLYlink modified 16 days ago • written 16 days ago by Hamtaro10

Hola, ¿cómo estás?

The idea of this tutorial is to perform Cox PH independently for each gene, i.e., it is univariate, and this can help to reduce a large number of variables, in your case, 350 to 35.

A penalised Cox regression would be multivariate and take all 350 genes concurrently. The 'final' list of genes would be those whose coefficients are not shrunk (reduced) to 0. You would do this via the glmnet package.

I think that both methods are compatible with each other. After you do the penalised Cox regression, you can still plot the survival curves for some of the genes that make it to the final list.

ADD REPLYlink modified 16 days ago • written 16 days ago by Kevin Blighe61k

Muy bien gracias! :P Thank you for your reply. Yes, I will do that. Do you know of any tutorials for doing the penalized Cox regression? I haven't found anything on the Internet applied to genes and clinical data.

ADD REPLYlink written 15 days ago by Hamtaro10

Yep / Sí, you could try this: https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html#cox

ADD REPLYlink written 15 days ago by Kevin Blighe61k

Thank you for you reply. I got it! But now, one more question. Now that I have the genes identified, I want to validate them with a validation set samples. What method would you use? I have taken my genes that affect patient survival and used them using the clinical data from the validation set patients, and nd I get a 0.9 AUC in ROC. But I think this method is not optimal, right?

ADD REPLYlink written 1 day ago by Hamtaro10

Not optimal in which way? You should derive the confidence intervals around the AUC, too.

ADD REPLYlink written 23 hours ago by Kevin Blighe61k

At first, I used that model with validation patient set to see if the ROC was still high. However, I read that this is not correct, as I am redoing the coefficients, not validating them.

To do a validation, I found this package that allows you to do internal and external validation. https://cran.r-project.org/web/packages/hdnom/vignettes/hdnom.html#2_build_survival_models

ADD REPLYlink written 20 hours ago by Hamtaro10

For that part, which is somewhat outside of my knowledge area, you may want to ask a question on a stats forum, like CrossValidated. I am actually only relatively recently working in internal and external calibration, so, I do not feel it is my place to provide advice right now.

ADD REPLYlink written 19 hours ago by Kevin Blighe61k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 645 users visited in the last hour