Question: Cox regression survival analysis - using a panel of genes with TCGA
gravatar for veronico
11 months ago by
veronico60 wrote:

I am using the survival package in R and have a panel of genes that I want to use to determine if these group of genes affect survival.

I towards the end and want to produce the graph, but I keep getting one line graph with what appears confidence lines flanking it.

Has anyone used it for a panel of genes and was able to produce the graph?

Here is my code:

plot(survfit(coxph(Surv(as.numeric(as.character(all_clin$new_death))[ind_clin],all_clin$death_event[ind_clin]) ~ event_rna[ind_gene,ind_tum]
  + event_rna[ind_gene4,ind_tum]
  + event_rna[ind_gene5,ind_tum]
  + event_rna[ind_gene6,ind_tum]
  + event_rna[ind_gene7,ind_tum]
  + event_rna[ind_gene10,ind_tum]
  + event_rna[ind_gene11,ind_tum]
  + event_rna[ind_gene13,ind_tum]
  + event_rna[ind_gene14,ind_tum]
  + event_rna[ind_gene15,ind_tum]
  + event_rna[ind_gene16,ind_tum]
  + event_rna[ind_gene17,ind_tum]
  + event_rna[ind_gene18,ind_tum]
  + event_rna[ind_gene19,ind_tum]
  + event_rna[ind_gene20,ind_tum]
  + event_rna[ind_gene21,ind_tum]
  + event_rna[ind_gene22,ind_tum]
  + event_rna[ind_gene23,ind_tum]
  + event_rna[ind_gene24,ind_tum]
  + event_rna[ind_gene25,ind_tum]
  + event_rna[ind_gene26,ind_tum]
  + event_rna[ind_gene27,ind_tum]
  + event_rna[ind_gene28,ind_tum]),
 col=c("Black","Green","Red"), frame=F, lwd=2,main=paste('Breast
 Cancer-TCGA',rownames(z_rna)[ind_gene],sep='\n'), xlim=c(0,10000),
 xlab = "Time (days)", ylab = "Survival Probability"))

My apologies...I am a newbie to the package survival and R.

survival panel tcga • 365 views
ADD COMMENTlink modified 11 months ago by Kevin Blighe65k • written 11 months ago by veronico60
gravatar for Kevin Blighe
11 months ago by
Kevin Blighe65k
Kevin Blighe65k wrote:

Hey veronico, if you are a 'newbie' to this, then I would really recommend tidying your code. Having lines like this can make yourself more prone to making errors (and makes it more difficult to debug any errors that arise):

plot(survfit(coxph(Surv(as.numeric(as.character(all_clin$new_death))[ind_clin],all_clin$death_event[ind_clin]) ~ event_rna[ind_gene,ind_tum] ...

Please explain from where you obtained your expression data, and paste an example of it here (if possible) - when pasting it, highlight the pasted data and then click on the 101 010 button (for correct formatting / display).

Also, if you supply a continuous variable, i.e., a gene's expression values, to a survival model, then all that you will see is a single stratum. You would have to segregate your sample cohort based on the gene's expression in some way if you wanted to compare, for example, patients with high versus low expression.

ADD COMMENTlink written 11 months ago by Kevin Blighe65k

Hi Kevin,

Thank you so much for answering!

I will include my whole code. Basically, I got mRNA-seq data from TCGA (Fire Browser). I followed the instructions from this link ( However, this looks at single genes at a time. So, thankfully the survival package includes cox.



rna <- read.table("~/Desktop/TCGA/RNA/",header=T,row.names=1,sep='\t')
rna <- rna[-1,]

clinical <- read.table('~/Desktop/TCGA/Clinical/BRCA.merged_only_clinical_clin_format.txt', quote="", header=T, row.names=1, sep='\t')
clinical <-

# first I remove genes whose expression is == 0 in more than 50% of the samples:
rem <- function(x){
  x <- as.matrix(x)
  x <- t(apply(x,1,as.numeric))
  r <- as.numeric(apply(x,1,function(i) sum(i == 0)))
  remove <- which(r > dim(x)[2]*0.5)
remove <- rem(rna)
rna <- rna[-remove,]

# see the values

# get the index of the normal/control samples
n_index <- which(substr(colnames(rna),14,14) == '1')
t_index <- which(substr(colnames(rna),14,14) == '0')

# apply voom function from limma package to normalize the data
vm <- function(x){
  cond <- factor(ifelse(seq(1,dim(x)[2],1) %in% t_index, 1,  0))
  d <- model.matrix(~1+cond)
  x <- t(apply(x,1,as.numeric))
  ex <- voom(x,d,plot=F)


rna_vm  <- vm(rna)
colnames(rna_vm) <- gsub('\\.','-',substr(colnames(rna),1,12))

# and check how data look, they should look normally-ish distributed
# we can remove the old "rna" cause we don't need it anymore

#Now we can finally scale the data. the reason to do so is because we don't want to use ONLY the fold changes. if we use FC then we average the expression values across all samples, losing the heterogeity that is characteristic of those data. we therefore transform data to z-score so that per each patient for each gene we will have a measure of how many SD away from the mean that is and we will consider those with Z > +/- 1.96 (roughly p=0.05 or 2 SD away) to be differentially expressed

#To obtain z-scores for the RNASeq data, we use following formula:

#z = [(value gene X in tumor Y)-(mean gene X in normal)]/(standard deviation X in normal)

# calculate z-scores
scal <- function(x,y){
  mean_n <- rowMeans(y)  # mean of normal
  sd_n <- apply(y,1,sd)  # SD of normal
  # z score as (value - mean normal)/SD normal
  res <- matrix(nrow=nrow(x), ncol=ncol(x))
  colnames(res) <- colnames(x)
  rownames(res) <- rownames(x)
  for(i in 1:dim(x)[1]){
    for(j in 1:dim(x)[2]){
      res[i,j] <- (x[i,j]-mean_n[i])/sd_n[i]
z_rna <- scal(rna_vm[,t_index],rna_vm[,n_index])
# set the rownames keeping only gene name
rownames(z_rna) <- sapply(rownames(z_rna), function(x) unlist(strsplit(x,'\\|'))[[1]])

rm(rna_vm) #we don't need it anymore

# match the patient ID in clinical data with the colnames of z_rna
clinical$IDs <- toupper(clinical$patient.bcr_patient_barcode)
sum(clinical$IDs %in% colnames(z_rna)) # we have 529 patients that we could use

# get the columns that contain data we can use: days to death, new tumor event, last day contact to....
ind_keep <- grep('days_to_new_tumor_event_after_initial_treatment',colnames(clinical))

# this is a bit tedious, since there are numerous follow ups, let's collapse them together and keep the first value (the higher one) if more than one is available
new_tum <- as.matrix(clinical[,ind_keep])
new_tum_collapsed <- c()
for (i in 1:dim(new_tum)[1]){
  if ( sum ([i,])) < dim(new_tum)[2]){
    m <- min(new_tum[i,],na.rm=T)
    new_tum_collapsed <- c(new_tum_collapsed,m)
  } else {
    new_tum_collapsed <- c(new_tum_collapsed,'NA')

# do the same to death
ind_keep <- grep('days_to_death',colnames(clinical))
death <- as.matrix(clinical[,ind_keep])
death_collapsed <- c()
for (i in 1:dim(death)[1]){
  if ( sum ([i,])) < dim(death)[2]){
    m <- max(death[i,],na.rm=T)
    death_collapsed <- c(death_collapsed,m)
  } else {
    death_collapsed <- c(death_collapsed,'NA')
ADD REPLYlink written 11 months ago by veronico60

Sorry, needed to put the rest of the code in another comment 'cuz I reached the max of characters.

# and days last follow up here we take the most recent which is the max number
    ind_keep <- grep('days_to_last_followup',colnames(clinical))
    fl <- as.matrix(clinical[,ind_keep])
    fl_collapsed <- c()
    for (i in 1:dim(fl)[1]){
      if ( sum ([i,])) < dim(fl)[2]){
        m <- max(fl[i,],na.rm=T)
        fl_collapsed <- c(fl_collapsed,m)
      } else {
        fl_collapsed <- c(fl_collapsed,'NA')

    # and put everything together
    all_clin <- data.frame(new_tum_collapsed,death_collapsed,fl_collapsed)
    colnames(all_clin) <- c('new_tumor_days', 'death_days', 'followUp_days')

    #error resulting from this is that NAs by coersion occur when, for example, you try to convert a character to numeric.
    #added suppressWarning
    # create vector with time to new tumor containing data to censor for new_tumor

    all_clin$new_time <- c()
    suppressWarnings(for (i in 1:length(as.numeric(as.character(all_clin$new_tumor_days)))){
      all_clin$new_time[i] <- ifelse ($new_tumor_days))[i]),
                                       as.numeric(as.character(all_clin$followUp_days))[i],as.numeric(as.character(all_clin$new_tumor_days      ))[i])

    # create vector time to death containing values to censor for death
    all_clin$new_death <- c()
    suppressWarnings(for (i in 1:length(as.numeric(as.character(all_clin$death_days)))){
      all_clin$new_death[i] <- ifelse ($death_days))[i]),

# create vector for death censoring
    # alive dead
    # 372   161

    #0=censor, 1=dead
    all_clin$death_event <- ifelse(clinical$patient.vital_status == 'alive', 0,1)

    #finally add row.names to clinical
    rownames(all_clin) <- clinical$IDs

    #Now, one more thing to do is to use the z-score values we obtained from the RNASeq data and define which samples are altered or do not change in this case I just look at mRNA deregulation, you can divide the data into up- and down- regulated too if you wish

    event_rna <- t(apply(z_rna, 1, function(x) ifelse(x < -1.96,1,0)))

    # since we need the same number of patients in both clinical and RNASeq data take the indices for the matching samples
    ind_tum <- which(unique(colnames(z_rna)) %in% rownames(all_clin))
    ind_clin <- which(rownames(all_clin) %in% colnames(z_rna))

    #Cox regression for multiple genes
    ind_gene <- which(rownames(z_rna) == 'ALG1L')
    ind_gene2 <- which(rownames(z_rna) == 'EMILIN2')
    ind_gene3 <- which(rownames(z_rna) == 'KHDC1')
    ind_gene4 <- which(rownames(z_rna) == 'LXN')

    res.cox <- coxph(Surv(as.numeric(as.character(all_clin$new_death))[ind_clin],all_clin$death_event[ind_clin])~event_rna[ind_gene,ind_tum] + event_rna[ind_gene2,ind_tum]+ event_rna[ind_gene3,ind_tum]+ event_rna[ind_gene4,ind_tum])

    plot(survfit(coxph(Surv(as.numeric(as.character(all_clin$new_death))[ind_clin],all_clin$death_event[ind_clin])~event_rna[ind_gene,ind_tum] + event_rna[ind_gene2,ind_tum]+ event_rna[ind_gene3,ind_tum]+ event_rna[ind_gene4,ind_tum]), col=c("Black","Green","Red"), frame=F, lwd=2,main=paste('Breast Cancer-TCGA',rownames(z_rna)[ind_gene],sep='\n'), xlim=c(0,10000), xlab = "Time (days)", ylab = "Survival Probability"))
ADD REPLYlink modified 11 months ago • written 11 months ago by veronico60

the panel of genes came from some analysis from my lab. I included only 4 genes just to simplify it. I get the following graph:

ADD REPLYlink modified 11 months ago • written 11 months ago by veronico60

Thank you for all of the code. This may just be an issue with encoding. For example, what is the output of event_rna[ind_gene,ind_tum] and how is it encoded (numerically, categorically, or text characters)?

Also, I think that this should be numerical:


I tested my own tutorial, and it should be fine where you specify 2 or more genes in the model: Survival analysis with gene expression


ADD REPLYlink modified 11 months ago • written 11 months ago by Kevin Blighe65k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1666 users visited in the last hour