Tutorial: Survival analysis with gene expression
32
gravatar for Kevin Blighe
4 months ago by
Kevin Blighe37k
Republic of Ireland
Kevin Blighe37k wrote:

For this example, we will load GEO breast cancer gene expression data with recurrence free survival (RFS) from Gene Expression Profiling in Breast Cancer: Understanding the Molecular Basis of Histologic Grade To Improve Prognosis. Specifically, we will encode each gene's expression into Low | Mid | High based on Z-scores and compare these against RFS in a Cox Proportional Hazards (Cox) survival model.

1, read in and prepare data

  library(Biobase)
  library(GEOquery)

  # load series and platform data from GEO
  gset <- getGEO('GSE2990', GSEMatrix =TRUE, getGPL=FALSE)
  x <- exprs(gset[[1]])

  # remove Affymetrix control probes
  x <- x[-grep('^AFFX', rownames(x)),]

  # transform the expression data to Z scores
  x <- t(scale(t(x)))

  # extract information of interest from the phenotype data (pdata)
  idx <- which(colnames(pData(gset[[1]])) %in%
    c('age:ch1', 'distant rfs:ch1', 'er:ch1',
      'ggi:ch1', 'grade:ch1', 'node:ch1',
      'size:ch1', 'time rfs:ch1'))

  metadata <- data.frame(pData(gset[[1]])[,idx],
    row.names = rownames(pData(gset[[1]])))

  # remove samples from the pdata that have any NA value
  discard <- apply(metadata, 1, function(x) any( is.na(x) ))
  metadata <- metadata[!discard,]

  # filter the Z-scores expression data to match the samples in our pdata
  x <- x[,which(colnames(x) %in% rownames(metadata))]

  # check that sample names match exactly between pdata and Z-scores 
  all((colnames(x) == rownames(metadata)) == TRUE)
  ## [1] TRUE

  # create a merged pdata and Z-scores object
  coxdata <- data.frame(metadata, t(x))

  # tidy column names
  colnames(coxdata)[1:8] <- c('Age', 'Distant.RFS', 'ER',
    'GGI', 'Grade', 'Node',
    'Size', 'Time.RFS')

  # prepare phenotypes
  coxdata$Distant.RFS <- as.numeric(coxdata$Distant.RFS)
  coxdata$Time.RFS <- as.numeric(gsub('^KJX|^KJ', '', coxdata$Time.RFS))
  coxdata$ER <- factor(coxdata$ER, levels = c(0, 1))
  coxdata$Grade <- factor(coxdata$Grade, levels = c(1, 2, 3))

2, test each gene independently via Cox regression

With the data prepared, we can now apply a Cox survival model independently for each gene (probe) in the dataset against RFS.

Here we will use RegParallel to fit the Cox model independently for each gene. RegParallel is accepted to Bioconductor (October 18, 2018), with official release in Bioconductor 3.8. To install dev version: devtools::install_github('kevinblighe/RegParallel')

  library(survival)
  library(RegParallel)

  res <- RegParallel(
    data = coxdata,
    formula = 'Surv(Time.RFS, Distant.RFS) ~ [*]',
    FUN = function(formula, data)
      coxph(formula = formula,
        data = data,
        ties = 'breslow',
        singular.ok = TRUE),
    FUNtype = 'coxph',
    variables = colnames(coxdata)[9:ncol(coxdata)],
    blocksize = 2000,
    cores = 2,
    nestedParallel = FALSE,
    conflevel = 95)
  res <- res[!is.na(res$P),]
  res

3, annotate top hits with biomaRt

Filter by Log Rank p<0.01

  res <- res[order(res$LogRank, decreasing = FALSE),]
  final <- subset(res, LogRank < 0.01)
  probes <- gsub('^X', '', final$Variable)

  library(biomaRt)
  mart <- useMart('ENSEMBL_MART_ENSEMBL', host='useast.ensembl.org')
  mart <- useDataset("hsapiens_gene_ensembl", mart)
  annotLookup <- getBM(mart = mart,
    attributes = c('affy_hg_u133a',
      'ensembl_gene_id',
      'gene_biotype',
      'external_gene_name'),
    filter = 'affy_hg_u133a',
    values = probes,
    uniqueRows = TRUE)

  annotLookup

Two of the top hits include CXCL12 and MMP10. High expression of CXCL12 was associated with good progression free and overall survival in breast cancer in doi: 10.1016/j.cca.2018.05.041, whilst high expression of MMP10 was associated with poor prognosis in colon cancer in doi: 10.1186/s12885-016-2515-7.

4, encode statistically significant genes as Low | Mid | High and plot survival curves

  # extract RFS and probe data for downstream analysis
  survplotdata <- coxdata[,c('Time.RFS', 'Distant.RFS',
    'X203666_at', 'X205680_at')]

  colnames(survplotdata) <- c('Time.RFS', 'Distant.RFS',
    'CXCL12', 'MMP10')

  # set Z-scale cut-offs for high and low expression
  highExpr <- 1.0
  lowExpr <- 1.0
  survplotdata$CXCL12 <- ifelse(survplotdata$CXCL12 >= highExpr, 'High',
    ifelse(x <= lowExpr, 'Low', 'Mid'))
  survplotdata$MMP10 <- ifelse(survplotdata$MMP10 >= highExpr, 'High',
    ifelse(x <= lowExpr, 'Low', 'Mid'))

  # relevel the factors to have mid as the ref level
  survplotdata$CXCL12 <- factor(survplotdata$CXCL12,
    levels = c('Mid', 'Low', 'High'))
  survplotdata$MMP10 <- factor(survplotdata$MMP10,
    levels = c('Mid', 'Low', 'High'))

  library(survminer)

  ggsurvplot(survfit(Surv(Time.RFS, Distant.RFS) ~ CXCL12,
    data = survplotdata),
    data = survplotdata,
    risk.table = TRUE,
    pval = TRUE,
    break.time.by = 500,
    ggtheme = theme_minimal(),
    risk.table.y.text.col = TRUE,
    risk.table.y.text = FALSE)

a

 ggsurvplot(survfit(Surv(Time.RFS, Distant.RFS) ~ MMP10,
    data = survplotdata),
    data = survplotdata,
    risk.table = TRUE,
    pval = TRUE,
    break.time.by = 500,
    ggtheme = theme_minimal(),
    risk.table.y.text.col = TRUE,
    risk.table.y.text = FALSE)

b

ADD COMMENTlink modified 8 days ago by Researcher10 • written 4 months ago by Kevin Blighe37k

One typo was found: discard <- apply(metadata, 1, function(x) anyis.na(x))) should be discard <- apply(metadata, 1, function(x) anyis.na(x)))

ADD REPLYlink written 28 days ago by crazy_ice0

Thank you for noticing that. My raw code was actually correct - the error (the lack of an extra parenthesis, (), was introduced in the visual representation of my code by the Biostars rendering system. I have added a space, and it now looks fine.

ADD REPLYlink written 28 days ago by Kevin Blighe37k

Gud one Kevin. I have a question. On what basis Z-scale cutoff 1.0 is selected?

BTW In this tutorial [http://r-addict.com/2016/11/21/Optimal-Cutpoint-maxstat.html] they have used maxstat (Maximally selected rank statistics) for the cutpoint to classify samples into high and low.

ADD REPLYlink written 28 days ago by Biologist130
1

That looks like a good tutorial (through the link that you posted).

The selection of absolute Z=1 was just chosen as a very relaxed threshold for highly / lowly expressed. It is difficult to know where the exact cut-offs should be, and of course biology does not intuitively work on cut-off points. The tutorial above is for fomenting new ideas for survival analysis.

ADD REPLYlink modified 28 days ago • written 28 days ago by Kevin Blighe37k

Dear Dr. Blighe

Thank you for this tutorial. I have a question about using Scale() for transforming expression data to Z scores. basically, why do we need transforming to z scores while our original data(downloaded from GEO) is normal?

I appreciate if you share your comment with me.

Mohammad

ADD REPLYlink modified 25 days ago • written 25 days ago by modarzi60
1

Hello Mohammad. The conversion to Z-scores provides for an easier interpretation on the expression range for each gene. For example, on the Z-scale, we know that +3 equates to 3 standard deviations above the mean expression value in the dataset.

In a normal distribution that is not transformed to Z-scale, a value of 10, 20, 30, et cetera may mean nothing in the context of the expression range.

The relationship between a normal distribution and the Z-scale is emphasised in this beautiful figure:

d

[source: https://www.mathsisfun.com/data/standard-normal-distribution.html]

ADD REPLYlink written 25 days ago by Kevin Blighe37k

Dear Dr. Blighe

Thanks for your answer. I have another questions about your SA tutorial due to using RNA-seq expression data:

1-Generally, the measure of expression in RNA-seq is count and different from measure of expression in Microarray Technology. So, for using RNA-seq, Should I modify your survival analysis code? special in Standardization step?

2- honestly, I cant understand '~ [*]' in formula = 'Surv(Time.RFS, Distant.RFS) ~ [*]'

3- phenotype of my data set has fours fields: 'OS status','OS days','RFS status','RFS days'. So, based on RegParallel(), can I compute 'res' using my phenotype fields? if yes, how can I use these fields in RegParallel()?

I deeply appreciate if you share your comment with me.

Mohammad

ADD REPLYlink modified 22 days ago • written 22 days ago by modarzi60
1

1-Generally, the measure of expression in RNA-seq is count and different from measure of expression in Microarray Technology. So, for using RNA-seq, Should I modify your survival analysis code? special in Standardization step?

You should aim to transform your normalised RNA-seq counts via the variance-stabilised or regularised log transformation (if using DESeq2), or produce log CPM counts (if using EdgeR). Then, you can generally use glm(), as I use above. If you are aiming to use the normalised, un-transformed counts, then you could use the negative binomial regression via glm.nb() - this may be too advanced, though. Check the manual (via ?RegParallel) and vignette for RegParallel.

Remember that, in RNA-seq, the general process goes:

  1. raw counts -->
  2. normalised counts (statistical analyses performed on these) -->
  3. transformed, normalised counts (for downstream analyses, clustering, PCA, etc.)

2- honestly, I cant understand '~ [*]' in formula = 'Surv(Time.RFS, Distant.RFS) ~ [*]'

This is annotation specific to my package, RegParallel. Each gene will replace the [*] symbol as the package tests each gene in an independent model. Again, please read the manual and vignette.

3- phenotype of my data set has fours fields: 'OS status','OS days','RFS status','RFS days'. So, based on RegParallel(), can I compute 'res' using my phenotype fields? if yes, how can I use these fields in RegParallel()?

Then you are likely aiming to do a survival analysis. In that case, you can use coxph(). Your commands would be:

OS <- RegParallel(
    data = coxdata,
    formula = 'Surv(OS.days, OS.status) ~ [*]',
    FUN = function(formula, data)
      coxph(formula = formula,
        data = data,
        ties = 'breslow',
        singular.ok = TRUE),
    FUNtype = 'coxph',
    variables = colnames(coxdata),
    blocksize = 500,
    cores = 2,
    nestedParallel = FALSE,
    conflevel = 95)

...or:

RFS <- RegParallel(
    data = coxdata,
    formula = 'Surv(RFS.days, RFS.status) ~ [*]',
    FUN = function(formula, data)
      coxph(formula = formula,
        data = data,
        ties = 'breslow',
        singular.ok = TRUE),
    FUNtype = 'coxph',
    variables = colnames(coxdata),
    blocksize = 500,
    cores = 2,
    nestedParallel = FALSE,
    conflevel = 95)

Note, you will likely have to change the value to variables. Variables is a vector of gene names that you want to test. It can be any number. If using RegParallel, the idea is that you have hundreds or thousands or millions of genes to test.

ADD REPLYlink written 22 days ago by Kevin Blighe37k

Dear Dr. Blighe

Really Thanks for your answer. But about my first question, I would like to explain more about my data set. It belongs to TCGA and I downloaded as UQ-FPKM. In RNA-seq analysis, this type of data set is normal. So, for using that I transformed it to Log2 space.

1- now, for using this data should I scale() for transformation to z-score?

2- based on my explanationabout TCGA data, which functions are better: glm() or glm.nb()?

3- why you didn't use coxph() for RNA-seq expression data set in RegParallel vignett? can I use this function for my data set?

I deeply appreciate if you share your comment with me.

Mohammad

ADD REPLYlink modified 22 days ago • written 22 days ago by modarzi60

Sorry, this is not how Biostars functions. I expect you to read my comments and to then spend some time researching the answers to any further questions that you have. That is the best form of learning.

ADD REPLYlink written 22 days ago by Kevin Blighe37k

Dear Dr.Blighe

Here you design Survival plot for 2 genes: 'MMP10' and 'CXCL12'. Suppose that we have a bunch of gene and after clustering we have n cluster. how can we design Surv plot for each cluster separately?

I appreciate if you share your solution with me

ADD REPLYlink modified 16 days ago • written 16 days ago by modarzi60

What do you mean by 'n cluster'?

ADD REPLYlink written 15 days ago by Kevin Blighe37k

n is number of cluster. For example 3 cluster(n=3).

ADD REPLYlink written 15 days ago by modarzi60

If you encode the gene's expression as a factor / categorical variable, then the survival function will plot a curve for each level

ADD REPLYlink written 15 days ago by Kevin Blighe37k

Hi Kevin, I am curious to ask can we use Beta values for methylation from each probe instead of the read-count from gene expression.

If yes, these values are continuous and range from 0 to 1, would it be recommended to convert these also to Z score.

ADD REPLYlink written 8 days ago by Researcher10

Hey, yes, you could use the Beta values from methylation for the purposes of survival analysis. I think that it is okay to leave the values as 0 to 1.

ADD REPLYlink written 8 days ago by Kevin Blighe37k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1760 users visited in the last hour