Question

Cox regression survival analysis - using a panel of genes with TCGA

1

Entering edit mode

5.7 years ago

veronico ▴ 70

I am using the survival package in R and have a panel of genes that I want to use to determine if these group of genes affect survival.

I towards the end and want to produce the graph, but I keep getting one line graph with what appears confidence lines flanking it.

Has anyone used it for a panel of genes and was able to produce the graph?

Here is my code:

plot(survfit(coxph(Surv(as.numeric(as.character(all_clin$new_death))[ind_clin],all_clin$death_event[ind_clin]) ~ event_rna[ind_gene,ind_tum]
  + event_rna[ind_gene4,ind_tum]
  + event_rna[ind_gene5,ind_tum]
  + event_rna[ind_gene6,ind_tum]
  + event_rna[ind_gene7,ind_tum]
  + event_rna[ind_gene10,ind_tum]
  + event_rna[ind_gene11,ind_tum]
  + event_rna[ind_gene13,ind_tum]
  + event_rna[ind_gene14,ind_tum]
  + event_rna[ind_gene15,ind_tum]
  + event_rna[ind_gene16,ind_tum]
  + event_rna[ind_gene17,ind_tum]
  + event_rna[ind_gene18,ind_tum]
  + event_rna[ind_gene19,ind_tum]
  + event_rna[ind_gene20,ind_tum]
  + event_rna[ind_gene21,ind_tum]
  + event_rna[ind_gene22,ind_tum]
  + event_rna[ind_gene23,ind_tum]
  + event_rna[ind_gene24,ind_tum]
  + event_rna[ind_gene25,ind_tum]
  + event_rna[ind_gene26,ind_tum]
  + event_rna[ind_gene27,ind_tum]
  + event_rna[ind_gene28,ind_tum]),
 col=c("Black","Green","Red"), frame=F, lwd=2,main=paste('Breast
 Cancer-TCGA',rownames(z_rna)[ind_gene],sep='\n'), xlim=c(0,10000),
 xlab = "Time (days)", ylab = "Survival Probability"))

My apologies...I am a newbie to the package survival and R.

TCGA Survival Panel • 2.8k views

ADD COMMENT • link updated 5.7 years ago by Kevin Blighe 89k • written 5.7 years ago by veronico ▴ 70

score 1 · Answer 1 · 2019-10-15

1

Entering edit mode

5.7 years ago

Kevin Blighe 89k

Hey veronico, if you are a 'newbie' to this, then I would really recommend tidying your code. Having lines like this can make yourself more prone to making errors (and makes it more difficult to debug any errors that arise):

plot(survfit(coxph(Surv(as.numeric(as.character(all_clin$new_death))[ind_clin],all_clin$death_event[ind_clin]) ~ event_rna[ind_gene,ind_tum] ...

Please explain from where you obtained your expression data, and paste an example of it here (if possible) - when pasting it, highlight the pasted data and then click on the 101 010 button (for correct formatting / display).

Also, if you supply a continuous variable, i.e., a gene's expression values, to a survival model, then all that you will see is a single stratum. You would have to segregate your sample cohort based on the gene's expression in some way if you wanted to compare, for example, patients with high versus low expression.

ADD COMMENT • link 5.7 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin,

Thank you so much for answering!

I will include my whole code. Basically, I got mRNA-seq data from TCGA (Fire Browser). I followed the instructions from this link (https://www.biostars.org/p/153013/). However, this looks at single genes at a time. So, thankfully the survival package includes cox.

Code:

#install.packages("survival")
library(survival)
setwd("~/Desktop/TCGA/")
list.files("~/Desktop/TCGA/")

rna <- read.table("~/Desktop/TCGA/RNA/BRCA.rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.data.txt",header=T,row.names=1,sep='\t')
rna <- rna[-1,]

clinical <- read.table('~/Desktop/TCGA/Clinical/BRCA.merged_only_clinical_clin_format.txt', quote="", header=T, row.names=1, sep='\t')
clinical <- as.data.frame(t(clinical))

# first I remove genes whose expression is == 0 in more than 50% of the samples:
rem <- function(x){
  x <- as.matrix(x)
  x <- t(apply(x,1,as.numeric))
  r <- as.numeric(apply(x,1,function(i) sum(i == 0)))
  remove <- which(r > dim(x)[2]*0.5)
  return(remove)
}
remove <- rem(rna)
rna <- rna[-remove,]

# see the values
table(substr(colnames(rna),14,14))

# get the index of the normal/control samples
n_index <- which(substr(colnames(rna),14,14) == '1')
t_index <- which(substr(colnames(rna),14,14) == '0')

# apply voom function from limma package to normalize the data
vm <- function(x){
  cond <- factor(ifelse(seq(1,dim(x)[2],1) %in% t_index, 1,  0))
  d <- model.matrix(~1+cond)
  x <- t(apply(x,1,as.numeric))
  ex <- voom(x,d,plot=F)
  return(ex$E)
}

#BiocManager::install("limma")
library("limma")

rna_vm  <- vm(rna)
colnames(rna_vm) <- gsub('\\.','-',substr(colnames(rna),1,12))


# and check how data look, they should look normally-ish distributed
hist(rna_vm)
# we can remove the old "rna" cause we don't need it anymore
rm(rna)

#Now we can finally scale the data. the reason to do so is because we don't want to use ONLY the fold changes. if we use FC then we average the expression values across all samples, losing the heterogeity that is characteristic of those data. we therefore transform data to z-score so that per each patient for each gene we will have a measure of how many SD away from the mean that is and we will consider those with Z > +/- 1.96 (roughly p=0.05 or 2 SD away) to be differentially expressed

#To obtain z-scores for the RNASeq data, we use following formula:

#z = [(value gene X in tumor Y)-(mean gene X in normal)]/(standard deviation X in normal)

# calculate z-scores
scal <- function(x,y){
  mean_n <- rowMeans(y)  # mean of normal
  sd_n <- apply(y,1,sd)  # SD of normal
  # z score as (value - mean normal)/SD normal
  res <- matrix(nrow=nrow(x), ncol=ncol(x))
  colnames(res) <- colnames(x)
  rownames(res) <- rownames(x)
  for(i in 1:dim(x)[1]){
    for(j in 1:dim(x)[2]){
      res[i,j] <- (x[i,j]-mean_n[i])/sd_n[i]
    }
  }
  return(res)
}
z_rna <- scal(rna_vm[,t_index],rna_vm[,n_index])
# set the rownames keeping only gene name
rownames(z_rna) <- sapply(rownames(z_rna), function(x) unlist(strsplit(x,'\\|'))[[1]])

rm(rna_vm) #we don't need it anymore

# match the patient ID in clinical data with the colnames of z_rna
clinical$IDs <- toupper(clinical$patient.bcr_patient_barcode)
sum(clinical$IDs %in% colnames(z_rna)) # we have 529 patients that we could use

# get the columns that contain data we can use: days to death, new tumor event, last day contact to....
ind_keep <- grep('days_to_new_tumor_event_after_initial_treatment',colnames(clinical))

# this is a bit tedious, since there are numerous follow ups, let's collapse them together and keep the first value (the higher one) if more than one is available
new_tum <- as.matrix(clinical[,ind_keep])
new_tum_collapsed <- c()
for (i in 1:dim(new_tum)[1]){
  if ( sum ( is.na(new_tum[i,])) < dim(new_tum)[2]){
    m <- min(new_tum[i,],na.rm=T)
    new_tum_collapsed <- c(new_tum_collapsed,m)
  } else {
    new_tum_collapsed <- c(new_tum_collapsed,'NA')
  }
}


# do the same to death
ind_keep <- grep('days_to_death',colnames(clinical))
death <- as.matrix(clinical[,ind_keep])
death_collapsed <- c()
for (i in 1:dim(death)[1]){
  if ( sum ( is.na(death[i,])) < dim(death)[2]){
    m <- max(death[i,],na.rm=T)
    death_collapsed <- c(death_collapsed,m)
  } else {
    death_collapsed <- c(death_collapsed,'NA')
  }
}

ADD REPLY • link 5.7 years ago by veronico ▴ 70

0

Entering edit mode

Sorry, needed to put the rest of the code in another comment 'cuz I reached the max of characters.

# and days last follow up here we take the most recent which is the max number
    ind_keep <- grep('days_to_last_followup',colnames(clinical))
    fl <- as.matrix(clinical[,ind_keep])
    fl_collapsed <- c()
    for (i in 1:dim(fl)[1]){
      if ( sum ( is.na(fl[i,])) < dim(fl)[2]){
        m <- max(fl[i,],na.rm=T)
        fl_collapsed <- c(fl_collapsed,m)
      } else {
        fl_collapsed <- c(fl_collapsed,'NA')
      }
    }

    # and put everything together
    all_clin <- data.frame(new_tum_collapsed,death_collapsed,fl_collapsed)
    colnames(all_clin) <- c('new_tumor_days', 'death_days', 'followUp_days')


    #error resulting from this is that NAs by coersion occur when, for example, you try to convert a character to numeric.
    #added suppressWarning
    # create vector with time to new tumor containing data to censor for new_tumor

    all_clin$new_time <- c()
    suppressWarnings(for (i in 1:length(as.numeric(as.character(all_clin$new_tumor_days)))){
      all_clin$new_time[i] <- ifelse ( is.na(as.numeric(as.character(all_clin$new_tumor_days))[i]),
                                       as.numeric(as.character(all_clin$followUp_days))[i],as.numeric(as.character(all_clin$new_tumor_days      ))[i])
    })

    # create vector time to death containing values to censor for death
    all_clin$new_death <- c()
    suppressWarnings(for (i in 1:length(as.numeric(as.character(all_clin$death_days)))){
      all_clin$new_death[i] <- ifelse ( is.na(as.numeric(as.character(all_clin$death_days))[i]),
                                        as.numeric(as.character(all_clin$followUp_days))[i],as.numeric(as.character(all_clin$death_days))[i])
    })

# create vector for death censoring
    table(clinical$patient.vital_status)
    # alive dead
    # 372   161

    #0=censor, 1=dead
    all_clin$death_event <- ifelse(clinical$patient.vital_status == 'alive', 0,1)

    #finally add row.names to clinical
    rownames(all_clin) <- clinical$IDs

    #Now, one more thing to do is to use the z-score values we obtained from the RNASeq data and define which samples are altered or do not change in this case I just look at mRNA deregulation, you can divide the data into up- and down- regulated too if you wish

    event_rna <- t(apply(z_rna, 1, function(x) ifelse(x < -1.96,1,0)))

    # since we need the same number of patients in both clinical and RNASeq data take the indices for the matching samples
    ind_tum <- which(unique(colnames(z_rna)) %in% rownames(all_clin))
    ind_clin <- which(rownames(all_clin) %in% colnames(z_rna))

    #Cox regression for multiple genes
    library(survival)
    #install.packages("survminer")
    library(survminer)
    ind_gene <- which(rownames(z_rna) == 'ALG1L')
    ind_gene2 <- which(rownames(z_rna) == 'EMILIN2')
    ind_gene3 <- which(rownames(z_rna) == 'KHDC1')
    ind_gene4 <- which(rownames(z_rna) == 'LXN')

    res.cox <- coxph(Surv(as.numeric(as.character(all_clin$new_death))[ind_clin],all_clin$death_event[ind_clin])~event_rna[ind_gene,ind_tum] + event_rna[ind_gene2,ind_tum]+ event_rna[ind_gene3,ind_tum]+ event_rna[ind_gene4,ind_tum])

    plot(survfit(coxph(Surv(as.numeric(as.character(all_clin$new_death))[ind_clin],all_clin$death_event[ind_clin])~event_rna[ind_gene,ind_tum] + event_rna[ind_gene2,ind_tum]+ event_rna[ind_gene3,ind_tum]+ event_rna[ind_gene4,ind_tum]), col=c("Black","Green","Red"), frame=F, lwd=2,main=paste('Breast Cancer-TCGA',rownames(z_rna)[ind_gene],sep='\n'), xlim=c(0,10000), xlab = "Time (days)", ylab = "Survival Probability"))

ADD REPLY • link 5.7 years ago by veronico ▴ 70

0

Entering edit mode

the panel of genes came from some analysis from my lab. I included only 4 genes just to simplify it. I get the following graph:

https://flic.kr/p/2hvMBui

ADD REPLY • link 5.7 years ago by veronico ▴ 70

0

Entering edit mode

Thank you for all of the code. This may just be an issue with encoding. For example, what is the output of event_rna[ind_gene,ind_tum] and how is it encoded (numerically, categorically, or text characters)?

Also, I think that this should be numerical:

all_clin$death_event[ind_clin]

I tested my own tutorial, and it should be fine where you specify 2 or more genes in the model: Survival analysis with gene expression

ADD REPLY • link 5.7 years ago by Kevin Blighe 89k