Tutorial: Survival analysis with gene expression
45
gravatar for Kevin Blighe
8 months ago by
Kevin Blighe43k
Republic of Ireland
Kevin Blighe43k wrote:

For this example, we will load GEO breast cancer gene expression data with recurrence free survival (RFS) from Gene Expression Profiling in Breast Cancer: Understanding the Molecular Basis of Histologic Grade To Improve Prognosis. Specifically, we will encode each gene's expression into Low | Mid | High based on Z-scores and compare these against RFS in a Cox Proportional Hazards (Cox) survival model.

1, read in and prepare data

  library(Biobase)
  library(GEOquery)

  # load series and platform data from GEO
  gset <- getGEO('GSE2990', GSEMatrix =TRUE, getGPL=FALSE)
  x <- exprs(gset[[1]])

  # remove Affymetrix control probes
  x <- x[-grep('^AFFX', rownames(x)),]

  # transform the expression data to Z scores
  x <- t(scale(t(x)))

  # extract information of interest from the phenotype data (pdata)
  idx <- which(colnames(pData(gset[[1]])) %in%
    c('age:ch1', 'distant rfs:ch1', 'er:ch1',
      'ggi:ch1', 'grade:ch1', 'node:ch1',
      'size:ch1', 'time rfs:ch1'))

  metadata <- data.frame(pData(gset[[1]])[,idx],
    row.names = rownames(pData(gset[[1]])))

  # remove samples from the pdata that have any NA value
  discard <- apply(metadata, 1, function(x) any( is.na(x) ))
  metadata <- metadata[!discard,]

  # filter the Z-scores expression data to match the samples in our pdata
  x <- x[,which(colnames(x) %in% rownames(metadata))]

  # check that sample names match exactly between pdata and Z-scores 
  all((colnames(x) == rownames(metadata)) == TRUE)
  ## [1] TRUE

  # create a merged pdata and Z-scores object
  coxdata <- data.frame(metadata, t(x))

  # tidy column names
  colnames(coxdata)[1:8] <- c('Age', 'Distant.RFS', 'ER',
    'GGI', 'Grade', 'Node',
    'Size', 'Time.RFS')

  # prepare phenotypes
  coxdata$Distant.RFS <- as.numeric(coxdata$Distant.RFS)
  coxdata$Time.RFS <- as.numeric(gsub('^KJX|^KJ', '', coxdata$Time.RFS))
  coxdata$ER <- factor(coxdata$ER, levels = c(0, 1))
  coxdata$Grade <- factor(coxdata$Grade, levels = c(1, 2, 3))

2, test each gene independently via Cox regression

With the data prepared, we can now apply a Cox survival model independently for each gene (probe) in the dataset against RFS.

Here we will use RegParallel to fit the Cox model independently for each gene. RegParallel is accepted to Bioconductor (October 18, 2018), with official release in Bioconductor 3.8. To install dev version: devtools::install_github('kevinblighe/RegParallel')

  library(survival)
  library(RegParallel)

  res <- RegParallel(
    data = coxdata,
    formula = 'Surv(Time.RFS, Distant.RFS) ~ [*]',
    FUN = function(formula, data)
      coxph(formula = formula,
        data = data,
        ties = 'breslow',
        singular.ok = TRUE),
    FUNtype = 'coxph',
    variables = colnames(coxdata)[9:ncol(coxdata)],
    blocksize = 2000,
    cores = 2,
    nestedParallel = FALSE,
    conflevel = 95)
  res <- res[!is.na(res$P),]
  res

3, annotate top hits with biomaRt

Filter by Log Rank p<0.01

  res <- res[order(res$LogRank, decreasing = FALSE),]
  final <- subset(res, LogRank < 0.01)
  probes <- gsub('^X', '', final$Variable)

  library(biomaRt)
  mart <- useMart('ENSEMBL_MART_ENSEMBL', host='useast.ensembl.org')
  mart <- useDataset("hsapiens_gene_ensembl", mart)
  annotLookup <- getBM(mart = mart,
    attributes = c('affy_hg_u133a',
      'ensembl_gene_id',
      'gene_biotype',
      'external_gene_name'),
    filter = 'affy_hg_u133a',
    values = probes,
    uniqueRows = TRUE)

  annotLookup

Two of the top hits include CXCL12 and MMP10. High expression of CXCL12 was associated with good progression free and overall survival in breast cancer in doi: 10.1016/j.cca.2018.05.041, whilst high expression of MMP10 was associated with poor prognosis in colon cancer in doi: 10.1186/s12885-016-2515-7.

4, encode statistically significant genes as Low | Mid | High and plot survival curves

  # extract RFS and probe data for downstream analysis
  survplotdata <- coxdata[,c('Time.RFS', 'Distant.RFS',
    'X203666_at', 'X205680_at')]

  colnames(survplotdata) <- c('Time.RFS', 'Distant.RFS',
    'CXCL12', 'MMP10')

  # set Z-scale cut-offs for high and low expression
  highExpr <- 1.0
  lowExpr <- -1.0
  survplotdata$CXCL12 <- ifelse(survplotdata$CXCL12 >= highExpr, 'High',
    ifelse(survplotdata$CXCL12 <= lowExpr, 'Low', 'Mid'))
  survplotdata$MMP10 <- ifelse(survplotdata$MMP10 >= highExpr, 'High',
    ifelse(survplotdata$MMP10 <= lowExpr, 'Low', 'Mid'))

  # relevel the factors to have mid as the ref level
  survplotdata$CXCL12 <- factor(survplotdata$CXCL12,
    levels = c('Mid', 'Low', 'High'))
  survplotdata$MMP10 <- factor(survplotdata$MMP10,
    levels = c('Mid', 'Low', 'High'))

  library(survminer)

  ggsurvplot(survfit(Surv(Time.RFS, Distant.RFS) ~ CXCL12,
    data = survplotdata),
    data = survplotdata,
    risk.table = TRUE,
    pval = TRUE,
    break.time.by = 500,
    ggtheme = theme_minimal(),
    risk.table.y.text.col = TRUE,
    risk.table.y.text = FALSE)

a

 ggsurvplot(survfit(Surv(Time.RFS, Distant.RFS) ~ MMP10,
    data = survplotdata),
    data = survplotdata,
    risk.table = TRUE,
    pval = TRUE,
    break.time.by = 500,
    ggtheme = theme_minimal(),
    risk.table.y.text.col = TRUE,
    risk.table.y.text = FALSE)

b

ADD COMMENTlink modified 5 weeks ago by brett.vanderwerff120 • written 8 months ago by Kevin Blighe43k
1

Kevin,

Unless there is a problem on my end, I think something may have gotten deprecated here. Running code as is only gives me mid and high curves for both genes.

Inspection of

survplotdata <- coxdata[,c('Time.RFS', 'Distant.RFS', 'X203666_at', 'X205680_at')]

shows that no samples meet the -1 zscore low expression cutoff (as far as I can see). I don't really have any questions about this. I just thought I would point it out just in case it is a repeatable error.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by brett.vanderwerff120

Hey, that is strange - thanks for the alert. I also just re-ran my own code and observe the same 'phenomenon'. Nothing surprises me anymore in bioinformatics, though. Bioinformatics is like the Wild Wild West.

Here are the new survival curves for this tutorial:

CXCL12

1

-------------------------

MMP10

2

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by Kevin Blighe43k

Kevin,

I actually do have a quick question related to this now that I think about it (if you have time). I see you have your expression factor with three levels:

survplotdata$CXCL12 <- factor(survplotdata$CXCL12,
levels = c('Mid', 'Low', 'High'))

In theory this was supposed to produce three curves. Lets say I have a similar multi leveled expression factor that produces multiple curves and I want to do a test that makes a pairwise comparison of every single curve. ie low vs mid, mid vs high etc.

I should just be able to run this command at endpoint which as I understand gives a benjamini hochberg adjusted log-rank test p value for every possible comparison of the multiple curves.

pairwise_survdiff(Surv(Time.RFS, Distant.RFS) ~ CXCL12, data=survplotdata)

Is this reasonable?

Thank you

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by brett.vanderwerff120

Yes, well, in the example above (my example), we could have done it better by dividing the expression range into tertiles to ensure that there would be at least 1 sample per group. I just chose a hard cut-off of Z=1, though. The tutorial is just to foment ideas, though.

I am not familiar with pairwise_survdiff() but it looks like a useful function. Thanks for mentioning it here.

ADD REPLYlink written 5 weeks ago by Kevin Blighe43k

One typo was found: discard <- apply(metadata, 1, function(x) anyis.na(x))) should be discard <- apply(metadata, 1, function(x) anyis.na(x)))

ADD REPLYlink written 4 months ago by crazy_ice0

Thank you for noticing that. My raw code was actually correct - the error (the lack of an extra parenthesis, (), was introduced in the visual representation of my code by the Biostars rendering system. I have added a space, and it now looks fine.

ADD REPLYlink written 4 months ago by Kevin Blighe43k
1

Gud one Kevin. I have a question. On what basis Z-scale cutoff 1.0 is selected?

BTW In this tutorial [http://r-addict.com/2016/11/21/Optimal-Cutpoint-maxstat.html] they have used maxstat (Maximally selected rank statistics) for the cutpoint to classify samples into high and low.

ADD REPLYlink written 4 months ago by Biologist150
1

That looks like a good tutorial (through the link that you posted).

The selection of absolute Z=1 was just chosen as a very relaxed threshold for highly / lowly expressed. It is difficult to know where the exact cut-offs should be, and of course biology does not intuitively work on cut-off points. The tutorial above is for fomenting new ideas for survival analysis.

ADD REPLYlink modified 4 months ago • written 4 months ago by Kevin Blighe43k

Dear Dr. Blighe

Thank you for this tutorial. I have a question about using Scale() for transforming expression data to Z scores. basically, why do we need transforming to z scores while our original data(downloaded from GEO) is normal?

I appreciate if you share your comment with me.

Mohammad

ADD REPLYlink modified 4 months ago • written 4 months ago by modarzi70
2

Hello Mohammad. The conversion to Z-scores provides for an easier interpretation on the expression range for each gene. For example, on the Z-scale, we know that +3 equates to 3 standard deviations above the mean expression value in the dataset.

In a normal distribution that is not transformed to Z-scale, a value of 10, 20, 30, et cetera may mean nothing in the context of the expression range.

The relationship between a normal distribution and the Z-scale is emphasised in this beautiful figure:

d

[source: https://www.mathsisfun.com/data/standard-normal-distribution.html]

ADD REPLYlink written 4 months ago by Kevin Blighe43k

Dear Dr. Blighe

Thanks for your answer. I have another questions about your SA tutorial due to using RNA-seq expression data:

1-Generally, the measure of expression in RNA-seq is count and different from measure of expression in Microarray Technology. So, for using RNA-seq, Should I modify your survival analysis code? special in Standardization step?

2- honestly, I cant understand '~ [*]' in formula = 'Surv(Time.RFS, Distant.RFS) ~ [*]'

3- phenotype of my data set has fours fields: 'OS status','OS days','RFS status','RFS days'. So, based on RegParallel(), can I compute 'res' using my phenotype fields? if yes, how can I use these fields in RegParallel()?

I deeply appreciate if you share your comment with me.

Mohammad

ADD REPLYlink modified 4 months ago • written 4 months ago by modarzi70
2

1-Generally, the measure of expression in RNA-seq is count and different from measure of expression in Microarray Technology. So, for using RNA-seq, Should I modify your survival analysis code? special in Standardization step?

You should aim to transform your normalised RNA-seq counts via the variance-stabilised or regularised log transformation (if using DESeq2), or produce log CPM counts (if using EdgeR). Then, you can generally use glm(), as I use above. If you are aiming to use the normalised, un-transformed counts, then you could use the negative binomial regression via glm.nb() - this may be too advanced, though. Check the manual (via ?RegParallel) and vignette for RegParallel.

Remember that, in RNA-seq, the general process goes:

  1. raw counts -->
  2. normalised counts (statistical analyses performed on these) -->
  3. transformed, normalised counts (for downstream analyses, clustering, PCA, etc.)

2- honestly, I cant understand '~ [*]' in formula = 'Surv(Time.RFS, Distant.RFS) ~ [*]'

This is annotation specific to my package, RegParallel. Each gene will replace the [*] symbol as the package tests each gene in an independent model. Again, please read the manual and vignette.

3- phenotype of my data set has fours fields: 'OS status','OS days','RFS status','RFS days'. So, based on RegParallel(), can I compute 'res' using my phenotype fields? if yes, how can I use these fields in RegParallel()?

Then you are likely aiming to do a survival analysis. In that case, you can use coxph(). Your commands would be:

OS <- RegParallel(
    data = coxdata,
    formula = 'Surv(OS.days, OS.status) ~ [*]',
    FUN = function(formula, data)
      coxph(formula = formula,
        data = data,
        ties = 'breslow',
        singular.ok = TRUE),
    FUNtype = 'coxph',
    variables = colnames(coxdata),
    blocksize = 500,
    cores = 2,
    nestedParallel = FALSE,
    conflevel = 95)

...or:

RFS <- RegParallel(
    data = coxdata,
    formula = 'Surv(RFS.days, RFS.status) ~ [*]',
    FUN = function(formula, data)
      coxph(formula = formula,
        data = data,
        ties = 'breslow',
        singular.ok = TRUE),
    FUNtype = 'coxph',
    variables = colnames(coxdata),
    blocksize = 500,
    cores = 2,
    nestedParallel = FALSE,
    conflevel = 95)

Note, you will likely have to change the value to variables. Variables is a vector of gene names that you want to test. It can be any number. If using RegParallel, the idea is that you have hundreds or thousands or millions of genes to test.

ADD REPLYlink written 4 months ago by Kevin Blighe43k

"normalised counts (statistical analyses performed on these) -->" i have this doubt found that you mentioned about it , are you saying about this function "counts(dds, normalized=TRUE)" whose value can be used for any non parametric statistical lest?

As of now i used mostly rlog and vst value for clustering and pca etc . Not sure if i can use any test on the rlog or vst value , as im not sure if its mathematically correct thing to do with the rlog or vst value or not...

If you can clarify it would be really helpful

ADD REPLYlink written 3 months ago by krushnach80500
1

No, it is just in the DESeq2 protocol (and EdgeR). The statistical comparisons are conducted on the normalised, un-transformed counts, which follow a negative binomial distribution. DESeq2 derives p-values, generally, as follows:

  1. fit negative binomial regression model independently for each gene's normalised counts
  2. extract p-value from the model coefficient via the Wald test applied to the model

*there are variations to this

One can, of course, produce normalised, transformed counts, and perform their own analyses on these.

ADD REPLYlink modified 3 months ago • written 3 months ago by Kevin Blighe43k

"No, it is just in the DESeq2 protocol (and EdgeR). The statistical comparisons are conducted on the normalised, un-transformed counts, which follow a negative binomial distribution. DESeq2 derives p-values, generally, as follows:

fit negative binomial regression model independently for each gene's normalised counts extract p-value from the model coefficient via the Wald test applied to the model" yes this part im clear as i read the same in the paper

"of course, produce normalised, transformed counts, and perform their own analyses on these." yes for this one as i get certain genes and i want to make comparison between biological sample .So if i do that comparison running some non parametric test then its not a problem , I guess

ADD REPLYlink written 3 months ago by krushnach80500

I see. Which test are you doing?

ADD REPLYlink written 3 months ago by Kevin Blighe43k

I tried wilcoxon rank test as of now...

ADD REPLYlink written 3 months ago by krushnach80500
1

Seems okay to me. Ask 10 people and you'll get 10 different answers, though. Each answer is based on the respective experience of the individual.

ADD REPLYlink written 3 months ago by Kevin Blighe43k

one doubt is clear now

ADD REPLYlink written 3 months ago by krushnach80500

Dear Dr. Blighe

I use TPM(Transaction per million) method for normalizing my RNA-Seq data set. base on your perfect tutorial I ran RegParallel() for getting survival analysis. I need your comment for 2 below questions:

1- I use 'coxph' as FUNtype for the regression model. is it a suitable function for my problem. if no, which function is your suggestion?

2- As you know in literature, we have multivariate Cox regression and univariate Cox regression. based on RegParallel(), our Survival Analysis is multivariate or univariate?

It would be really helpful If you can clarify me.

ADD REPLYlink modified 1 day ago • written 1 day ago by modarzi70

Yes, coxph is the correct function. TPM is not too bad if you are testing each gene independently, i.e., univariate (in my tutorial, above, each gene is tested independently as part of a univariate Cox model);

ADD REPLYlink written 1 day ago by Kevin Blighe43k

Ok, Thanks for your comment. logically, doing multivariate Cox Regression for lots of genes(more than 150 genes) is true? if you agree, how can I run it? can you guide me by tutorial such as the above tutorial?

ADD REPLYlink written 1 day ago by modarzi70

If you want to run a multivariate Cox model with that many variables, then consider using the Lasso Cox model: https://cran.r-project.org/web/packages/glmnet/vignettes/Coxnet.pdf

ADD REPLYlink written 1 day ago by Kevin Blighe43k

Dear Dr. Blighe

Really Thanks for your answer. But about my first question, I would like to explain more about my data set. It belongs to TCGA and I downloaded as UQ-FPKM. In RNA-seq analysis, this type of data set is normal. So, for using that I transformed it to Log2 space.

1- now, for using this data should I scale() for transformation to z-score?

2- based on my explanationabout TCGA data, which functions are better: glm() or glm.nb()?

3- why you didn't use coxph() for RNA-seq expression data set in RegParallel vignett? can I use this function for my data set?

I deeply appreciate if you share your comment with me.

Mohammad

ADD REPLYlink modified 4 months ago • written 4 months ago by modarzi70

Sorry, this is not how Biostars functions. I expect you to read my comments and to then spend some time researching the answers to any further questions that you have. That is the best form of learning.

ADD REPLYlink written 4 months ago by Kevin Blighe43k

Dear Dr.Blighe

Here you design Survival plot for 2 genes: 'MMP10' and 'CXCL12'. Suppose that we have a bunch of gene and after clustering we have n cluster. how can we design Surv plot for each cluster separately?

I appreciate if you share your solution with me

ADD REPLYlink modified 4 months ago • written 4 months ago by modarzi70

What do you mean by 'n cluster'?

ADD REPLYlink written 4 months ago by Kevin Blighe43k

n is number of cluster. For example 3 cluster(n=3).

ADD REPLYlink written 4 months ago by modarzi70

If you encode the gene's expression as a factor / categorical variable, then the survival function will plot a curve for each level

ADD REPLYlink written 4 months ago by Kevin Blighe43k

Hi Kevin, I am curious to ask can we use Beta values for methylation from each probe instead of the read-count from gene expression.

If yes, these values are continuous and range from 0 to 1, would it be recommended to convert these also to Z score.

ADD REPLYlink written 4 months ago by Researcher30

Hey, yes, you could use the Beta values from methylation for the purposes of survival analysis. I think that it is okay to leave the values as 0 to 1.

ADD REPLYlink written 4 months ago by Kevin Blighe43k

Thanks Kevin, I tried your suggestion and was able to identify prognostic CpG sites. I did the same using gene expression data and interestingly found some overlapping genes. But I am not very sure how to integrate these two results as methylation can regulate the expression of genes that are in trans. In order to address that, checking just the overlap would not work. I do not know how should I proceed. I will really appreciate if u can share your thoughts about it.

Thanks B

ADD REPLYlink modified 3 months ago • written 3 months ago by Researcher30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1781 users visited in the last hour