Tutorial: Survival analysis of TCGA patients integrating gene expression (RNASeq) data
98
gravatar for TriS
2.4 years ago by
TriS3.2k
United States, Buffalo
TriS3.2k wrote:

I found myself being often confused about how to do this and by various posts and tutorials online I realized that, if I could put together a step-by-step tutorial, that might be (hopefully) useful...so there we go.

First question: Why would you want to do survival analysis based on gene expression data? Well, let's say you have a number of genes that you are interested in and they are differentially expressed between tumor and normal samples, it would be very powerful to show that alteration in gene expression correlates with worse survival or earlier tumor recurrence.

Here's a brief overview of what's in here:

  1. obtain the data (RNASeq and clinical data)
  2. prepare the data for the analysis
  3. run the analysis and interpret the results

In this tutorial I will be using data from kidney cancer (clear cell renal cell carcinoma - KIRC) that you can get as follows:

  1. ​go to FireBrowse ( http://gdac.broadinstitute.org/ ), select "Kidney renal clear cell carcinoma" -> "Browse"
  2. it should open a page with various data sets for KIRC
  3. from "mRNASeq" select "illuminahiseq_rnaseqv2-RSEM_genes_normalized" and save it on your computer
  4. from "Clinical" select "Merge_Clinical" and download it
  5. unzip the files by double clicking on them
  6. rename the folders as "RNA" and "Clinical"

Now you can switch to R, you will need the package "survival" (https://cran.r-project.org/web/packages/survival/index.html)

# set directory to where the RNA and Clinical folders are
setwd("/User/myUser/tutorial")
library(survival)

# read RNA file 
rna <- read.table('RNA/KIRC.rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.data.txt',nrows=20533, header=T,row.names=1,sep='\t')
# and take off first row cause we don't need it
rna <- rna[-1,]

# and read the Clinical file, in this case i transposed it to keep the clinical feature title as column name
clinical <- t(read.table('Clinical/KIRC.merged_only_clinical_clin_format.txt',header=T, row.names=1, sep='\t'))

# first I remove genes whose expression is == 0 in more than 50% of the samples:
rem <- function(x){
  x <- as.matrix(x)
  x <- t(apply(x,1,as.numeric))
  r <- as.numeric(apply(x,1,function(i) sum(i == 0)))
  remove <- which(r > dim(x)[2]*0.5)
  return(remove)
}
remove <- rem(rna)
rna <- rna[-remove,]

Now I need to identify normal and tumor samples. this is done using the TCGA barcode (https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode). The two digits at position 14-15 of the barcode will indicate teh sample type, from the link:
"Tumor types range from 01 - 09, normal types from 10 - 19 and control samples from 20 - 29."

# see the values
table(substr(colnames(rna),14,14))
# 0      1 
# 534   72

So we have 534 tumor and 72 normal

# get the index of the normal/control samples
n_index <- which(substr(colnames(rna),14,14) == '1')
t_index <- which(substr(colnames(rna),14,14) == '0')


# apply voom function from limma package to normalize the data
vm <- function(x){
  cond <- factor(ifelse(seq(1,dim(x)[2],1) %in% t_index, 1,  0))
  d <- model.matrix(~1+cond)
  x <- t(apply(x,1,as.numeric))
  ex <- voom(x,d,plot=F)
  return(ex$E)
}

rna_vm  <- vm(rna)
colnames(rna_vm) <- gsub('\\.','-',substr(colnames(rna),1,12))


# and check how data look, they should look normally-ish distributed
hist(rna_vm)
# we can remove the old "rna" cause we don't need it anymor
rm(rna)

Now we can finally scale the data. the reason to do so is because we don't want to use ONLY the fold changes. if we use FC then we average the expression values across all samples, losing the heterogeity that is characteristic of those data. we therefore transform data to z-score so that per each patient for each gene we will have a measure of how many SD away from the mean that is and we will consider those with Z > +/- 1.96 (roughly p=0.05 or 2 SD away) to be differentially expressed

To obtain z-scores for the RNASeq data, we use following formula:

z = [(value gene X in tumor Y)-(mean gene X in normal)]/(standard deviation X in normal)


# calculate z-scores
scal <- function(x,y){
  mean_n <- rowMeans(y)  # mean of normal
  sd_n <- apply(y,1,sd)  # SD of normal
  # z score as (value - mean normal)/SD normal
  res <- matrix(nrow=nrow(x), ncol=ncol(x))
  colnames(res) <- colnames(x)
  rownames(res) <- rownames(x)
  for(i in 1:dim(x)[1]){
    for(j in 1:dim(x)[2]){
      res[i,j] <- (x[i,j]-mean_n[i])/sd_n[i]
    }
  }
  return(res)
}
z_rna <- scal(rna_vm[,t_index],rna_vm[,n_index])
# set the rownames keeping only gene name
rownames(z_rna) <- sapply(rownames(z_rna), function(x) unlist(strsplit(x,'\\|'))[[1]])

rm(rna_vm) #we don't need it anymore

It might take some time, depending from your computer, be patient

Now we can start using the clinical data:

# match the patient ID in clinical data with the colnames of z_rna
clinical$IDs <- toupper(clinical$patient.bcr_patient_barcode)
sum(clinical$IDs %in% colnames(z_rna)) # we have 529 patients that we could use

# get the columns that contain data we can use: days to death, new tumor event, last day contact to....
ind_keep <- grep('days_to_new_tumor_event_after_initial_treatment',colnames(clinical))

# this is a bit tedious, since there are numerous follow ups, let's collapse them together and keep the first value (the higher one) if more than one is available
new_tum <- as.matrix(clinical[,ind_keep])
new_tum_collapsed <- c()
for (i in 1:dim(new_tum)[1]){
  if ( sum ( is.na(new_tum[i,])) < dim(new_tum)[2]){
    m <- min(new_tum[i,],na.rm=T)
    new_tum_collapsed <- c(new_tum_collapsed,m)
  } else {
    new_tum_collapsed <- c(new_tum_collapsed,'NA')
  }
}

# do the same to death
ind_keep <- grep('days_to_death',colnames(clinical))
death <- as.matrix(clinical[,ind_keep])
death_collapsed <- c()
for (i in 1:dim(death)[1]){
  if ( sum ( is.na(death[i,])) < dim(death)[2]){
    m <- max(death[i,],na.rm=T)
    death_collapsed <- c(death_collapsed,m)
  } else {
    death_collapsed <- c(death_collapsed,'NA')
  }
}

# and days last follow up here we take the most recent which is the max number
ind_keep <- grep('days_to_last_followup',colnames(clinical))
fl <- as.matrix(clinical[,ind_keep])
fl_collapsed <- c()
for (i in 1:dim(fl)[1]){
  if ( sum is.na(fl[i,])) < dim(fl)[2]){
    m <- max(fl[i,],na.rm=T)
    fl_collapsed <- c(fl_collapsed,m)
  } else {
    fl_collapsed <- c(fl_collapsed,'NA')
  }
}

# and put everything together
all_clin <- data.frame(new_tum_collapsed,death_collapsed,fl_collapsed)
colnames(all_clin) <- c('new_tumor_days', 'death_days', 'followUp_days')

Now, to do survival analysis we need three main things:

  1. time: this is the time till an event happens
  2. status: this indicates which patients have to be kept for the analysis
  3. event: this tells i.e. which patients have the gene up- or down-regulated or have no changes in expression

Since we want to do censored analysis, we need to have something to censor the data with. For example, if a patient has no death data BUT there is a date to last followup it means that after that day we know nothing about the patient, therefore after that day it cannot be used for calculations/Kaplan Meier plot anymore, therefore we censor it. so now we need to create vectors for both 'time to new tumor' and 'time to death' that contain also the data from censored individuals.

# create vector with time to new tumor containing data to censor for new_tumor
all_clin$new_time <- c()
for (i in 1:length(as.numeric(as.character(all_clin$new_tumor_days)))){
  all_clin$new_time[i] <- ifelse ( is.na(as.numeric(as.character(all_clin$new_tumor_days))[i]),
                    as.numeric(as.character(all_clin$followUp_days))[i],as.numeric(as.character(all_clin$new_tumor_days))[i])
}

# create vector time to death containing values to censor for death
all_clin$new_death <- c()
for (i in 1:length(as.numeric(as.character(all_clin$death_days)))){
  all_clin$new_death[i] <- ifelse ( is.na(as.numeric(as.character(all_clin$death_days))[i]),
                                 as.numeric(as.character(all_clin$followUp_days))[i],as.numeric(as.character(all_clin$death_days))[i])
}

Now we need to create the vector for censoring the data which means telling R which patients are dead or have new tumor. in this case if a patient has a "days_to_death" it will be assigned 1, and used in the corresponding analysis. the reason why we censor with death events even for recurrence is pretty important. a colleague made me notice that this is a competitive risk problem, where, although a patient can recur and then die, if a patient is dead, it will not recur, therefore is more accurate to censor for death events.

# create vector for death censoring
table(clinical$patient.vital_status)
# alive dead
# 372   161

all_clin$death_event <- ifelse(clinical$patient.vital_status == 'alive', 0,1)

#finally add row.names to clinical
rownames(all_clin) <- clinical$IDs

Now, one more thing to do is to use the z-score values we obtained from the RNASeq data and define which samples are altered or do not change in this case I just look at mRNA deregulation, you can divide the data into up- and down- regulated too if you wish

# create event vector for RNASeq data
event_rna <- t(apply(z_rna, 1, function(x) ifelse(abs(x) > 1.96,1,0)))

# since we need the same number of patients in both clinical and RNASeq data take the indices for the matching samples
ind_tum <- which(unique(colnames(z_rna)) %in% rownames(all_clin))
ind_clin <- which(rownames(all_clin) %in% colnames(z_rna))

# pick your gene of interest
ind_gene <- which(rownames(z_rna) == 'TP53')

# check how many altered samples we have
table(event_rna[ind_gene,])

# run survival analysis
s <- survfit(Surv(as.numeric(as.character(all_clin$new_death))[ind_clin],all_clin$death_event[ind_clin])~event_rna[ind_gene,ind_tum])
s1 <- tryCatch(survdiff(Surv(as.numeric(as.character(all_clin$new_death))[ind_clin],all_clin$death_event[ind_clin])~event_rna[ind_gene,ind_tum]), error = function(e) return(NA))

# extraect the p.value
pv <- ifelse ( is.na(s1),next,(round(1 - pchisq(s1$chisq, length(s1$n) - 1),3)))[[1]]

# plot the data
plot(survfit(Surv(as.numeric(as.character(all_clin$new_death))[ind_clin],all_clin$death_event[ind_clin])~event_rna[ind_gene,ind_tum]),
     col=c(1:3), frame=F, lwd=2,main=paste('KIRK',rownames(z_rna)[ind_gene],sep='\n'))

# add lines for the median survival
x1 <- ifelse ( is.na(as.numeric(summary(s)$table[,'median'][1])),'NA',as.numeric(summary(s)$table[,'median'][1]))
x2 <- as.numeric(summary(s)$table[,'median'][2])
if(x1 != 'NA' &amp; x2 != 'NA'){
&nbsp; lines(c(0,x1),c(0.5,0.5),col='blue')
&nbsp; lines(c(x1,x1),c(0,0.5),col='black')
&nbsp; lines(c(x2,x2),c(0,0.5),col='red')
}

# add legend
legend(1800,0.995,legend=paste('p.value = ',pv[[1]],sep=''),bty='n',cex=1.4)
legend(max(as.numeric(as.character(all_clin$death_days)[ind_clin]),na.rm = T)*0.7,0.94,
       legend=c(paste('NotAltered=',x1),paste('Altered=',x2)),bty='n',cex=1.3,lwd=3,col=c('black','red'))

Now, the p.value in this case tells you that patients with altered TP53 have a worse survival…sad but true.

The same concept can be applied to time to new tumor (new_time) (disease free survival, or DFS) and the resulting graph is the following:

This approach is very flexible, I leave to the reader the satisfaction to subset the events in up/downregulation and/or add covariates.

I know there are a number of different ways of doing this, hopefully this will be clear enought that the reader can replicate this.

Lastly, thanks for getting till down here, comments, critiques, corrections etc are always welcome :)

UPDATES

  • used voom transformation for RNASeq data

  • created the death event vector based on the vital status from the clinical data

  • updated code and images based on the results

  • corrected the last rna_log variable (leftover from previous versions) to rna_vm

  • updated "min" to "max" for the "days_to_last_followup" based on cying comment

  • updated "min" to "max" for the new_tumor and death
  • changed the ifsumis issue with brackets
survival rna-seq tcga tutorial • 28k views
ADD COMMENTlink modified 7 weeks ago by lenazhou0 • written 2.4 years ago by TriS3.2k
4

I think this is a great tutorial, thanks for posting!  

One thing I would add is to be careful of how you define a recurrence for the DFS data.  While overall survival is rather straightforward, DFS can get complicated and is tumor dependent.  For example, in head and neck cancer a patient is clinically classified as "WITH TUMOR" until the tests can be performed to verify the tumor is gone (which can be months from the original appointment).  This can result in a "WITH TUMOR" classification in TCGA data for the first followup when really the patient was tumor free, they just haven't had the test done yet.  Also, in HNSC tumors patients re-diagnosed with a tumor within 6 months of treatment completion it is considered part of the original tumor, not a new tumor, so it is residual tumor and not a recurrence.  For HNSC at least (according to my expert collaborators) a re-diagnosed patient is considered to have a recurrence of the original tumor if it's between 6 months and 5 years, and 5 years post-treatment it is considered a new primary tumor.  Of course these may be different for each tumor type.  So, if one is looking to publish survival plot for DFS I would urge them to talk to a specialist in the cancer type to get specific guidelines on what disease-free survival means, and then manually clean up the DFS time points and events before plotting using the treatment data as a guide. 

ADD REPLYlink written 2.4 years ago by alolex870
1

@alolex - thanks for the insightful feedback. you definitely make a valid point and I agree with you, the concept of DFS can be molded depending upon annotation accuracy/frequency and tumor type. and also, yes, before publishing any study on clinical data I would warmly suggest to talk to a clinician/statistician to make sure that the results make sense.

as for the KIRC data go, there is a "person_neoplasm_cancer_status" field that indicates whether the patient is "with tumor" or "tumor free". for the purpose of the tutorial I censored based on the presence of a "days_to_new_tumor_event" value but I agree it is more accurate to use the "with tumor" or "tumor free" indicators. I will update the code asap. thaks

ADD REPLYlink written 2.4 years ago by TriS3.2k
1

@TriS thank you for this useful tutorial !

I think it would be better to change the lg2 function to:

lg2 <- function(x){
  x <- as.matrix(x)
  x <- t(apply(x,1,as.numeric))
  res <- log2((x+1))
  return(res)
}
ADD REPLYlink written 2.4 years ago by al3n70rn90

@al3n70rn - thanks for the feedback, I'll change that. I was also looking into the VST transformation from DESeq2 which might even be a better choice. I'll add that asap 

ADD REPLYlink written 2.4 years ago by TriS3.2k

@TriS - Do you have a github account?  It might be a good idea to post this script there so people can easily download and make code change suggestions via Github.

ADD REPLYlink written 2.4 years ago by alolex870

@alolex - that's a good idea ,thanks for the suggestion. I'll update the code this evening and then upload it on github as soon as I understand how that works.

ADD REPLYlink written 2.4 years ago by TriS3.2k
1

Great tutorial, Thanks!

I had a problem when using the TCGA BRCA data, the rnal_log is not being created. So downstream analysis is not performed. It seems the n_index and t_index are formed and also the voom function is fine.

Thanks a lot for your help!

ADD REPLYlink written 2.1 years ago by jyothipadiadpu10

that is actually a typo on my side, my apologies, it's the rna_vm, not rna_log, that's probably leftover from the first version i wrote, without adding the latest edits.

thanks for catching it, i'm gonna fix it now

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by TriS3.2k

 Great ! it worked now. Thanks a lot.

ADD REPLYlink written 2.1 years ago by jyothipadiadpu10
1

link to the Firehouse should be updated

from

http://gdac.broadinstitute.org/)

to

http://gdac.broadinstitute.org/

excluding the closing paranthesis.

ADD REPLYlink written 19 months ago by bounlu100

Thank you so much for a great tutorial.

Sorry if these queries are a bit basic - probably need some more R learning!

Firstly, could I please ask how we would go about limiting this to, for example, five year overall survival?

Finally, how would I go about limiting the results to say just those who had radiotherapy (patient.radiation_therapy = yes)

Thanks again!

ADD REPLYlink written 2.1 years ago by jcfleming.ent0

there are a couple of things you could do.

1) keep the survival/follow up time only up to 5 years -> 5(yrs)*365(days)=1825(days)

2) treatment data should be in the merged survival file, if not they are in one of the other files you can download. you need to create a dummy variable with therapy_yes=1 therapy_no=0 and select out only the patients with a 1..there are a million ways of selecting them out, this is just one of them.

3) last but most important, you might want to look up some basic survival analysis in R and how to extract subsets of data. this sounds basic but it's vital since it's what allows you to do things correctly and what you will  refer to when you want to do a lil trickier analysis like the one you are asking 

ADD REPLYlink written 2.1 years ago by TriS3.2k

TriS Thank you so much for this tutorial. I found it very helpful.

I am interested in comparing the patients with overexpression of a gene and underexpression of that gene. To this end, I want to compare survival of patients with Z-score in the range <-1.96, versus [-1,1] versus > 1.96.

Is it correct to do the following?

event_rna <- t(apply(z_rna, 0, function(x) ifelse(x > 1.96,2,ifelse(x < -1.96,1,0)))) # for up and down regulated
 
ADD REPLYlink modified 22 months ago • written 22 months ago by Les Ander100
1

yes, that should work, however, the apply function you wrote has an error/typo, you need to use "1" to apply the function row by row

event_rna <- t(apply(z_rna, 1, function(x) ifelse(x > 1.96,2,ifelse(x < -1.96,1,0)))) 
ADD REPLYlink written 22 months ago by TriS3.2k

@TriS Thank you so much!

 

ADD REPLYlink written 22 months ago by Les Ander100

Deleted the comment and instead put it as a new comment

ADD REPLYlink modified 22 months ago • written 22 months ago by Les Ander100

Hi thanks for the great tutorial.  Was there a particular reason you used the normalized gene count as opposed to the raw or scaled estimate?  thank you! 

ADD REPLYlink written 22 months ago by ahdee0

yes and no. those data were available from Firehose but you can apply the same pipeline to other measurements i.e. raw counts

ADD REPLYlink written 22 months ago by TriS3.2k

Thank you - reason why I ask is because I'm wondering what the reasoning behind using the the voom transformation?  I would assume since the data is "normalized" wouldn't we be able to bypass this step?   sorry I'm just trying to understand this a bit more so I'm wondering if you can elaborate on this a bit more. thanks

ADD REPLYlink written 22 months ago by ahdee0

no probs, asking is good. you could use other transformations too. voom transformation is very practical and better than i.e. just taking the log2. here and here there are a few info about why voom is a good method + you can also use it with RPKM, which were the data available from TCGA. however, if you had raw counts you could have used DESEq2 to transform the data, for the sake of this tutorial and this analysis I don't think it will make a big difference (note: this should be tested!!). I do prefer to normalize because of the skeweness of the RNASeq/RPKM data which could inflate the actual differences between the different samples/genes.

ADD REPLYlink written 22 months ago by TriS3.2k

you rock! thank you - btw you should put this on github.  

 

ADD REPLYlink written 22 months ago by ahdee0

I am very new to this kind of analysis and try to do my first using your great tutorial. I took a look at this link which I think is the paper introducing voom. It says that this method can be used with RPKM values, but RNAseq data you provided in this tutorial is normalized with RSEM method. Does this make any difference?

ADD REPLYlink written 21 months ago by Vasei10

Hi quick question. Do you think it is better to eliminate the normal samples all together in the z-score calculations?

ADD REPLYlink written 21 months ago by ahdee0

Great post. I have several questions/feedback:

  1. Why did you use KIRC.merged_only_clinical_clin_format.txt instead of KIRC.clin.merged.txt?
  2. There is a Clinical_Pick_Tier1 file when you go to FireBrowse > KIRC > Clinical. Do you know what it is and how it is different than the one you used? I just checked the final numbers for new_death and they correspond to days_to_death (for event dead) and days_to_last_followup (for event alive) combined in the Clinical_Pick_Tier1 file. If you use that file, you might save all the processing that you are doing to get the numbers for new_death.
  3. Although the numbers for survival match, your death_event is different than those given in the file (given as vital_status). I don't know how that file was processed so cannot comment on who is incorrect but wanted to put it out there.
  4. Why are you using min for new_tum_collapsed instead of max when you just want the highest value?
  5. For each of the collapse methods, I had to convert the columns to numeric for e.g. fl <- apply(fl, 2, as.numeric) otherwise I had character and factors which messed up the min/max values.

Thanks!

ADD REPLYlink modified 12 months ago • written 12 months ago by komal.rathi3.0k
  1. you can use either, I just picked one. this tutorial is a guideline, not a rule
  2. I didn't know that was available at the time of this tutorial, thank you for bringing that up, it is very useful
  3. this post is almost a year and a half old, I would expect some differences in the events
  4. because I took the time at which the first tumor re-appeared, not the last one
  5. ok
ADD REPLYlink written 12 months ago by TriS3.2k

Hi, Tris

Thanks for the great tutorial. it is exactly what I want to do on colorectal cancer datasets. I'm an extreme newbie to bioinformatics, being able to run it on my dataset would be something big already. but I seem to have some difficulties in plotting median survival, when I run it on either your original dataset or the colorectal cancer dataset it gives me error message as follow:

> x1 <- ifelseis.na(as.numeric(summary(s)$table[,'median'][1])),"NA",as.numeric(summary(s)$table[,'median'][1]))
> x2 <- as.numeric(summary(s)$table[,'median'][2])
> if(x1 != "NA" & x2 != "NA"){
+   lines(c(0,x1),c(0.5,0.5),col="blue")
+   lines(c(x1,x1),c(0,0.5),col="black")
+   lines(c(x2,x2),c(0,0.5),col="red")
+ }
Error in if (x1 != "NA" & x2 != "NA") { : 
  missing value where TRUE/FALSE needed

Would you mind help me to look into this, if you get a chance? I truly have no idea of how to correct this. Very much appreciated

Cheers,

ADD REPLYlink modified 8 months ago by WouterDeCoster24k • written 11 months ago by tongtong.wang12100

Did you figure out how to do this ? I am getting exactly same problem as well !!

ADD REPLYlink written 8 months ago by always_learning800
1

Check the values of x1 and x2 to figure out if those are as expected, something is wrong in those.

ADD REPLYlink written 8 months ago by WouterDeCoster24k
clinical <- t(read.table("Clinical/BRCA.merged_only_clinical_clin_format.txt", header = T, row.names = 1, sep = "\t"))

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :

line 1141 did not have 1098 elements

What should I do for the above error?

ADD REPLYlink modified 11 months ago by WouterDeCoster24k • written 11 months ago by ragsadhav0
1

I moved your post to a comment (because it is no answer to the original thread).

Your 1141th line doesn't have the same number of elements (columns) as the rest of the file, so you should have a look at that line and see if something is odd.

awk 'NR==1141' yourfile.txt

I'm not sure if the header is counted as a line by R, so perhaps you should also check the next line:

awk 'NR==1142' yourfile.txt

and compare

Or is 1141/1142 the last line in your file containing whitespaces?
Have a look at the number of lines:

wc -l yourfile.txt
ADD REPLYlink modified 11 months ago • written 11 months ago by WouterDeCoster24k

I am not able to open this website, can you please provide me an alternative to download this data set. need it urgently. thanks in advance.

ADD REPLYlink written 8 months ago by dickshachohan20

which website? firehose?

ADD REPLYlink written 8 months ago by TriS3.2k
1

In your post (as remarked C: Survival analysis of TCGA patients integrating gene expression (RNASeq) data ) the closing parenthesis got included in the URL, leading to a 404. I have edited your post (added spaces) to solve this.

ADD REPLYlink written 8 months ago by WouterDeCoster24k

thank you I appreciate it, i didn't realize the parenthesis was included

ADD REPLYlink written 8 months ago by TriS3.2k

As noted here, Could someone with edit priviledges change all the ifsumis.na to ifsumis.na, please

ADD REPLYlink written 7 months ago by russhh2.8k
1

i'll modify it now...I'm not sure why that happens. it seems that some parenthesis are cut off. my apologies for the inconvenient

ADD REPLYlink written 7 months ago by TriS3.2k

in addition there is ifelse here

all_clin$new_time <- c()
for (i in 1:length(as.numeric(as.character(all_clin$new_tumor_days)))){
  all_clin$new_time[i] <- ifelse( is.na(as.numeric(as.character(all_clin$new_tumor_days))[i]),
                    as.numeric(as.character(all_clin$followUp_days))[i],as.numeric(as.character(all_clin$new_tumor_days))[i])
}
ADD REPLYlink modified 7 months ago • written 7 months ago by mms14013050
rna_vm  <- vm(rna)
 Show Traceback

 Rerun with Debug
 Error in voom(x, d, plot = F) : 
  Need at least two genes to fit a mean-variance trend

I am getting above error while running vm function. Can some one help me on this ?

ADD REPLYlink written 4 months ago by always_learning800

this should be posted as a new question since it's not specific for this tutorial. anyhow..1) how does head(rna) look? 2) it seems you don't have a design matrix d 3) how many genes are in rna?

ADD REPLYlink modified 4 months ago • written 4 months ago by TriS3.2k
4
gravatar for gitarrestunden
2.1 years ago by
Austria
gitarrestunden40 wrote:

Hi!

Thank you for this tutorial!! I trid the KIRK dataset and overall, it worked great!

I had to change the following line because of an error:

clinical <- t(read.table('Clinical/KIRC.merged_only_clinical_clin_format.txt',header=T, row.names=1, sep='\t'))

changed to:

clinical <- read.table('Clinical/KIRC.merged_only_clinical_clin_format.txt',header=T, row.names=1, sep='\t')
clinical <- as.data.frame(t(clinical))

Reason:

clinical$IDs <- toupper(clinical$patient.bcr_patient_barcode)
Error in toupper(clinical$patient.bcr_patient_barcode) : 
  error in evaluating the argument 'x' in selecting a method for function 'toupper': Error in clinical$patient.bcr_patient_barcode : 
  $ operator is invalid for atomic vectors

In addition, I get an error when I want to calculate p-values, however, I cant solve the problem:

pv <- ifelseis.na(s1),next,(round(1 - pchisq(s1$chisq, length(s1$n) - 1),3)))[[1]]
Error in ifelseis.na(s1), next, (round(1 - pchisq(s1$chisq, length(s1$n) -  : 
  error in evaluating the argument 'yes' in selecting a method for function 'ifelse': Error: no loop for break/next, jumping to top level

Can someone help me to fix this?

Regards, G

ADD COMMENTlink modified 8 months ago by Ram12k • written 2.1 years ago by gitarrestunden40

Ok, got it, this one works:

> pv <- if is.na(s1)) next else(round(1 - pchisq(s1$chisq, length(s1$n) - 1),7))[[1]]

Regards, G

ADD REPLYlink modified 8 months ago by Ram12k • written 2.1 years ago by gitarrestunden40

gitarrestunden - thanks for the feedback. I see a few issues in the code that you wrote, and that includes parenthesis and spaces between words. I'm not sure how the following line works since the parenthesis are off :

pv <- if is.na(s1)) next else(round(1 - pchisq(s1$chisq, length(s1$n) - 1),7))[[1]]

and it should be more like :

pv <- ifelseis.na(s1), next,(round(1 - pchisq(s1$chisq, length(s1$n) - 1),7))[[1]]
ADD REPLYlink modified 8 months ago by Ram12k • written 2.1 years ago by TriS3.2k

Yes, I know what you mean, but the ifelse statement does not work in my system. I tried with correct parentheses and with or without commas/spaces...

> pv <- ifelseis.na(s1)),next,(round(1 - pchisq(s1$chisq, length(s1$n) - 1),7))[[1]]
Error: unexpected ',' in 'pv <- ifelseis.na(s1)),'

This is the only working version on my system:

> pv <- if is.na(s1)) next else(round(1 - pchisq(s1$chisq, length(s1$n) - 1),7))[[1]]

Strange..

ADD REPLYlink modified 8 months ago by Ram12k • written 2.1 years ago by gitarrestunden40

And it is also strange that the first parenthesis which should come directly after the "if" always disappears when I click the add comment button!!

ADD REPLYlink written 2.1 years ago by gitarrestunden40

hmm...I wonder if there is a "lost" parenthesis somewhere else in your code

the ifelse function in R requires the commas and the reason why it tells you that there is an unexpected "," is because there is a parenthesis missing. the code you wrote is a one liner that sums up the formal if/else statements

for the parenthesis not showing up in the forum, I really have no idea :)

ADD REPLYlink written 2.1 years ago by TriS3.2k

Hi Tris thank you for the tutorial, it is very useful. I get a similar error in generating p- value. Also some of the corrections made in the command line still shows an error. Can someone help fix this. Thank you in advance.

> pv <- ifelseis.na(s1),next,(round(1 - pchisq(s1$chisq, length(s1$n) - 1),3)))[[1]]
**Error** in ifelseis.na(s1), next, (round(1 - pchisq(s1$chisq, length(s1$n) -  :  
    no loop for break/next, jumping to top level

> pv <- if is.na(s1)) next else(round(1 - pchisq(s1$chisq, length(s1$n) - 1),7))[[1]]
**Error**: unexpected symbol in " pv <- if is.na"

> pv <- ifelseis.na(s1), next,(round(1 - pchisq(s1$chisq, length(s1$n) - 1),7))[[1]]
**Error**: unexpected ',' in "pv <- ifelseis.na(s1),"
ADD REPLYlink modified 8 months ago by Ram12k • written 11 months ago by devi.jaganathan0

it seems that there is a problem on the comment above where some brackets don't show up.

pv <- ifelse ( is.na(s1),next,(round(1 - pchisq(s1$chisq, length(s1$n) - 1),3)))[[1]]
ADD REPLYlink modified 11 months ago • written 11 months ago by TriS3.2k

Hi Tris, Thanks a lot for the correction.

ADD REPLYlink written 11 months ago by devi.jaganathan0
pv <- ifelse ( is.na(s1),next,(round(1 - pchisq(s1$chisq, length(s1$n) - 1),3)))[[1]]
Error in ifelseis.na(s1), next, (round(1 - pchisq(s1$chisq, length(s1$n) -  : 
  no loop for break/next, jumping to top level

Can you tell me how to correct this error

ADD REPLYlink written 7 months ago by mms14013050

how does your s1 object look?

ADD REPLYlink written 7 months ago by TriS3.2k

s1 has 6 elements

> s1$chisq
[1] 1.528015
> s1$n
groups
event_rna[ind_gene, ind_tum]=0 event_rna[ind_gene, ind_tum]=1 
                           596                            497 
> s1$obs
[1] 50 54
> s1$exp
[1] 56.27811 47.72189
> s1$var
          [,1]      [,2]
[1,]  25.79472 -25.79472
[2,] -25.79472  25.79472
> s1$call
survdiff(formula = Surv(as.numeric(as.character(all_clin$new_death))[ind_clin], 
    all_clin$death_event[ind_clin]) ~ event_rna[ind_gene, ind_tum])
ADD REPLYlink written 7 months ago by mms14013050

hmmm, ok, try to run s1 and summary(s1) you should get something like the following

> s1
Call:
survdiff(formula = f ~ a)

      N Observed Expected (O-E)^2/E (O-E)^2/V
a=1  77       44     45.5    0.0464    0.0794
a=2 106       69     67.5    0.0312    0.0794

 Chisq= 0.1  on 1 degrees of freedom, p= 0.778 
> summary(s1)
      Length Class  Mode   
n     2      table  numeric
obs   2      -none- numeric
exp   2      -none- numeric
var   4      -none- numeric
chisq 1      -none- numeric
call  2      -none- call

this is from a different analysis but I copied and pasted your code and it works fine

> ifelse ( is.na(s1),next,(round(1 - pchisq(s1$chisq, length(s1$n) - 1),3)))[[1]]
[1] 0.778
ADD REPLYlink modified 7 months ago • written 7 months ago by TriS3.2k
1
gravatar for vibes1002003
2.1 years ago by
vibes100200330
Germany
vibes100200330 wrote:

I would like to know how can extend this code for survival analysis of multiple genes altogether or cancer subtypes ?  I want to check effect of set of genes/ subtypes on survival probability through KM plot?

ADD COMMENTlink written 2.1 years ago by vibes100200330
1

you should create a dummy variable in the survival analysis where you select patients with the gene signature altered

i.e. for genes A, B and C, you have a matrix m with row=genes, cols=patients:

gene_sig <- c(A,B,C)
rna_data_signature <- rna_data[gene_sig,]
event_rna <- apply(rna_data, 2, function(x) ifelse(sum(abs(x) > 1.96 >= 1),1,0))

not tested but it should work. the idea is to use apply to select, by column, patients that have at least one gene in the signature altered

you can than use the event_rna vector in the survival analysis

ADD REPLYlink modified 8 months ago by Ram12k • written 2.1 years ago by TriS3.2k

How can we make non-altered case since we need two groups -altered and unaltered for generating KM plot ? for instance you take those genes which are altered at least once in all the patients then how can we make unaltered case? 

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by vibes100200330

if you look at the code I put in the answer, the patients with at least one alteration will be assigned a 1, if they have no alterations they will be assigned a zero. it basically checks column by column (= patient by patient) how many of those genes are altered, if the # of altered genes is > 1, then they enter the "altered" group.

ADD REPLYlink written 2.1 years ago by TriS3.2k
1
gravatar for cying
21 months ago by
cying10
China
cying10 wrote:

Hi, TriS. Your tutorial is very helpful.

I have done my own survival analysis using clinical file before merged and cleaned, but actually, the merge clinical file downloaded from the FireBrowse is more concise. Here I list two questions:

  1. "days_to_last_followup", you said that get the minmum because the it's the latest number.But the definition of "days_to_last_followup" is time interval from the date of last followup to the date of initial pathologic diagnosis, which means the the longer days represent latest followups, so the days should take the maximum number.

  2. to plot DFS survival curve, a vector contains censored or completed status should be created. In the merge clinical file, the column "patient.new_tumor_events.new_tumor_event_after_initial_treatment" defines the new tumor event status "yes", "no", or "NA".How to understand "NA" status? Does it mean the patient haven't occur new tumor event at the time of the last followup or means haven't survey the new tumor event status at the time of the last followup? The variable "all_clin$new_time" in your code shows you have regarded the "NA" as "haven't occur new tumor event"(the same as status "no").Is that Right?

Thanks.

ADD COMMENTlink modified 21 months ago • written 21 months ago by cying10

hi cying, you made very good points and thank you for posting them. 1. yes, you are right! I overlooked the definition of "days to last followup" and you are correct, you should use the highest one. I will change the code accordingly 2. the NA in all_clin$new_time were counted as event that didn't happen, in this case no recurrence happened. to be honest I'm not sure about the difference between "no" and "NA" in "new_tumor_event_after_initial_treatement". i.e. there is an xls file for melanoma enrollment that annotates the field as "Indicate whether the patient had a new tumor event (e.g. metastatic, recurrent, or new primary tumor) after the date of initial melanoma diagnosis. If the tumor that yielded the sample submitted for TCGA was a new tumor event (i.e. it was diagnosed after the initial diagnosis), the remaining questions should not be completed. Please Note:If the patient did not have a new tumor event, the remaining questions can be skipped." therefore when they annotate for no/NA they might indicate the same thing, and it actually depends from who annotates the data. hope this helps!

ADD REPLYlink written 21 months ago by TriS3.2k

That's a good point, maybe the the blank space of "new_tumor_event_after_initial_treatement" column is a hint of no new tumor events. Thanks!

ADD REPLYlink written 21 months ago by cying10
1
gravatar for enxxx23
12 months ago by
enxxx23170
Finland
enxxx23170 wrote:

1) would be possible to get also the code which generates the DFS plot also?

2) is there somewhere some cleaned up and up to date version of this code?

3) when running this code (and fixing myself some bugs which still has like for example adding "clinical = data.frame(clinical)" ) the KIRK-OS plot looks slightly different in the sense that I get NotAltered = NA and also I get a lot of warnings when computing all_clin$new_time ,like for example, " In ifelseis.na(as.numeric(as.character(all_clin$new_tumor_days))[i]), ... : NAs introduced by coercion"). Otherwise the plot looks similar. Is this normal?

ADD COMMENTlink modified 12 months ago • written 12 months ago by enxxx23170
1

1) for the DFS plot you can use the DFS data in the clinical file. 2) no, the version here is what I had the time to create 3) you will get some warnings, if your code is correct these warnings come from the data and it's ok, some will be NA

ADD REPLYlink written 12 months ago by TriS3.2k
1
gravatar for JP
10 months ago by
JP0
JP0 wrote:

thanks so much for the step-by-step tutorial, it has been very helpful for a biologist who is new to R and survival analysis.

I noticed that I am getting the following errors that don't allow for survfit() to function properly:

Error #1:

all_clin$new_death <- c() for (i in 1:length(as.numeric(as.character(all_clin$death_days)))){ + all_clin$new_death[i] <- ifelseis.na(as.numeric(as.character(all_clin$death_days))[i]), + as.numeric(as.character(all_clin$followUp_days))[i],as.numeric(as.character(all_clin$death_days))[i]) + }

Which gives me the following error:

In ifelseis.na(as.numeric(as.character(all_clin$death_days))[i]),  ... :

NAs introduced by coercion

I read above that NAs shoudn't be a problem with this step, so I continued. However, when I get to the survfit function I get the second error

Error #2:

> s <- survfit(Surv(as.numeric(as.character(all_clin$new_death))[ind_clin],all_clin$death_event[ind_clin])~event_rna[ind_gene,ind_tum])

Error in Surv(as.numeric(as.character(all_clin$new_death))[ind_clin], : Invalid status value, must be logical or numeric

Any help with this would be greatly appreciate, and I am sorry in advance if I don't understand your response right away, as I said, I am very new to R and survival analysis.

Thanks again for the great tutorial!

JP

ADD COMMENTlink written 10 months ago by JP0
1
gravatar for mforde84
10 months ago by
mforde841.0k
mforde841.0k wrote:

Hi TriS,

Thank you very much for posting this. It was very helpful.

I hope you don't mind, but I decided to fork your code up to github, so that there's an example that's runs out of box. https://github.com/mforde84/RNAseq-Survival-Analysis-TCGA-KIRC

My results are slightly different than yours, so I would very much appreciate if you would review my code and see if anything erroneous sticks out. I simplified a number of things as well (like the input data set and preprocessing steps), just to help improve readability.

I think the differences are due to two things: 1)different versions of the RNAseq dataset in question (mine was accessed on Feb 2, 2017), and 2) specific R functionality has changed in less obvious ways which may break or alter the results of the analysis. For 1), collapsing and calculating the censors using the omics data provided generates slightly different results, but only for a handful of samples. Not much, but there certainly is a difference. For two, e.g., reading the clinical omics table into a R now creates an atomic character vector which doesn't utilize the $ operator, so later in your scripting you'll throw an error:

> clinical <- t(read.table("KIRC.merged_only_clinical_clin_format.txt", header=T, row.names=1, sep="\t"))
> typeof(clinical)
[1] "character"
> clinical$IDs <- toupper(clinical$patient.bcr_patient_barcode)
Error in clinical$patient.bcr_patient_barcode : 
  $ operator is invalid for atomic vectors
> clinical <- data.frame(clinical) #working solution

Not a big deal. Just thought it would be helpful to keep something current for anyone who want to learn.

Thanks again, Martin

ADD COMMENTlink written 10 months ago by mforde841.0k

hi Martin

thank you for putting it on github, that's something I should have done a long time ago. and thank you for the updates as this post is more than one year old, I would expect to see some differences with what I posted at that time. clinical data are/should be updated and that will change, slightly, the results. Seb

ADD REPLYlink written 10 months ago by TriS3.2k

Hello, Thanks for putting the code in github. I have seen the just now. But Could you please tell how to use this code for Disease free survival plot? Thank you

ADD REPLYlink written 6 months ago by Bioinfo90
1

that is pretty simple, use disease free survival data, generally as DFS in the clinical data. then there should be a column saying "tumor free/with tumor".

ADD REPLYlink written 6 months ago by TriS3.2k

Thankyou Tris. I have an other question. How to check the survival for High and Low expression of a gene? I have gene expression data of TP53 and divided the samples into two groups i.e. High expression and low expression. So, with this I want to see the survival b/w High and low. With your above code instead of normal and tumor, should I take High and low expression samples? Am I right?

ADD REPLYlink written 6 months ago by Bioinfo90

you can assign groups to patients that have high/low expression, i.e. high=1, low=2, middle=0, remove the 0s and run survival as above.

as warm suggestion, look a lil more into how to run survival analyses in R, don't just copy and paste the code above without knowing what's doing what. that will be very helpful in the long (and short) run

ADD REPLYlink modified 6 months ago • written 6 months ago by TriS3.2k
1
gravatar for xushutan
6 months ago by
xushutan20
xushutan20 wrote:

A website for Breast cancer survival curve in different subtypes: luminal A, luminal B, Basal, Her2 and Normal-like. http://tumorsurvival.org/

ADD COMMENTlink written 6 months ago by xushutan20
0
gravatar for nashtf
2.1 years ago by
nashtf0
United States
nashtf0 wrote:

I tried running this with the BRCA data and get the following error on reading in the merged clinical data:

clinical <- t(read.table('Clinical/BRCA.merged_only_clinical_clin_format.txt',header=T,row.names=1,sep='\t'))
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
    line 2 did not have 1099 elements
ADD COMMENTlink modified 8 months ago by Ram12k • written 2.1 years ago by nashtf0
3

add 

quote="" 

to the read.table() function, that should fix it

ADD REPLYlink written 2.1 years ago by TriS3.2k
0
gravatar for gitarrestunden
2.1 years ago by
Austria
gitarrestunden40 wrote:

Hi!

One last question: You wrote that 'you can divide the data into up- and down- regulated too if you wish';

I guess i have to change this line, but I'm not sure how to do that correctly:

event_rna <- t(apply(z_rna, 1, function(x) ifelse(abs(x) > 1.96,1,0)))

Thanks for help again!!

ADD COMMENTlink modified 8 months ago by Ram12k • written 2.1 years ago by gitarrestunden40
1
event_rna <- t(apply(z_rna, 1, function(x) ifelse(x > 1.96,1,ifelse(x < -1.96,2,0))))
ADD REPLYlink modified 8 months ago by Ram12k • written 2.1 years ago by TriS3.2k
0
gravatar for sunqiangzai
2.1 years ago by
China
sunqiangzai0 wrote:

Great tutorial!I am curious about the number of the altered and not-altered group,since the number of patients are 529,why the altered group are so large?

ADD COMMENTlink written 2.1 years ago by sunqiangzai0

I suppose you refer to the numbers in the legend. those are not the number of patients but the median survival in days in the altered and not altered groups. to get the months you can do (ndays/365)*12

ADD REPLYlink written 2.1 years ago by TriS3.2k
0
gravatar for Palgrave
23 months ago by
Palgrave20
Singapore
Palgrave20 wrote:

Hi all, great tutorial.

I have two questions:

1) When calculating p-values I get this error, similar as above:

pv <- if is.na(s1)) next else(round(1 - pchisq(s1$chisq, length(s1$n) - 1),7))[[1]]    #error
Error: unexpected symbol in 'pv <- if is.na'

2) Is there a method to find the relevant genes that are significant?

Best wishes,

R

ADD COMMENTlink modified 8 months ago by Ram12k • written 23 months ago by Palgrave20

1) When calculating p-values I get this error, similar as above:

pv <- if is.na(s1)) next else(round(1 - pchisq(s1$chisq, length(s1$n) - 1),7))[[1]]     #error
Error: unexpected symbol in "pv <- if is.na"

your parenthesis are off, try this (not tested):

pv <- ifelse(is.na(s1),next,(round(1 - pchisq(s1$chisq, length(s1$n) - 1),7)))[[1]]

2) Is there a method to find the relevant genes that are significant?

you mean differentially expressed genes? you might want to read a lil about z-scores and differential gene expression analysis or limma

ADD REPLYlink written 23 months ago by TriS3.2k

Is it possible to run through all my rownames and print all corresponding plots and p-values:

ind_gene <- which(rownames(z_rna) == "TP53")
ADD REPLYlink written 22 months ago by Palgrave20

yes, using for loops and even writing it as a function :)

however, you might not want to print 15k to 30k plots, you might want to find a way to keep those you are interested in

ADD REPLYlink modified 22 months ago • written 22 months ago by TriS3.2k

Would you mind helping, I tried but got errors

ADD REPLYlink written 22 months ago by Palgrave20
1

I'm going to give you a pseudocode that you can apply to your analysis:

for 1 to # of genes
  run survival analysis
  extract p.value
  if p.value is significant & some other conditions
    plot the graph

...learning how to code does come with LOTS of trial and error, it's good that you got errors cause it can show you what you did wrong.

also, your question then becomes more general on the lines of 'how to apply a function to multiple genes', which can be 1) googled 2) posted on the general forum since it's not specific anymore to this tutorial

ADD REPLYlink modified 8 months ago by Ram12k • written 22 months ago by TriS3.2k
0
gravatar for Les Ander
22 months ago by
Les Ander100
United States
Les Ander100 wrote:

When I use the GBM data and the TP53, I get an error in the vm function (at the

d <- model.matrix(~1+cond)

step).

The error message is

Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
    contrasts can be applied only to factors with 2 or more levels
Called from: `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]])

Any idea how to fix this?

Thanks

ADD COMMENTlink modified 8 months ago by Ram12k • written 22 months ago by Les Ander100

can you check that the n_index and t_index are not empty? also that cond does contain data?

ADD REPLYlink written 22 months ago by TriS3.2k

This is most likely happening due to there being no normal samples in the dataset, so n_index is empty.

I don't really know enough about model.matrix to know if this is the correct approach, but by not computing the model.matrix and changing the following line:

ex <- voom(x,d,plot=F)

to:

ex <- voom(x,plot=F)

the voom function works. However the voom function states that by leaving out a design, the function "defaults to the unit vector meaning the samples are treated as replicates" and this may not be true.

ADD REPLYlink written 4 months ago by sophiespo0
0
gravatar for Les Ander
22 months ago by
Les Ander100
United States
Les Ander100 wrote:

The histogram of z-scores is not quite symmetric, it tends to have a heavier tail on the lower end (i.e. there are more genes below summit) so it does not look like that > 1.96 is equivalent to p=0.05 (my guess is that it is much lower than that. Similarly, I don't think that the Z < -1.96 is p=0.05 either (my guess is that it is much higher than that). Any suggestions how to deal with such asymmetric distribution? My inclination would be to use percentiles i.e. top 10% of the data versus bottom 10% of the data. However, I am not sure how to modify the code to do that. Any suggestions would be much appreciated.

I think the change has to be something like below but not sure if it is correct:

pr_thresh=quantile(z_rna, c(0.1, 0.9))

LOWTHRESH=pr_thresh[1]
HIGHTHRESH=pr_thresh[2]

event_rna <- t(apply(z_rna, 1, function(x) ifelse(x > HIGHTHRESH,2,ifelse(x <- LOWTHRESH,1,0))))

Any thoughts on the use of percentiles and/or on the implementation?

Thanks

ADD COMMENTlink modified 8 months ago by Ram12k • written 22 months ago by Les Ander100
1

the data won't always be perfectly symmetric, you can cut your data so that you have the same number of patients in each side in order to balance it out when doing survival analysis. however, when we are talking about hundreds of patients, small differences in n do not cause issues. some statisticians here use even z=+/-1.5 as long as it gives them symmetric/even groups.

the 1.96 threshold is not a must, cBioportal uses 2, you can use what you prefer and what fits your needs as long as there is a rationale behind it. also, the z scores here are calculated comparing T to N so you won't perfectly center the data around 0 as if you would scale the data within the same sample group.

for percentile you can use the top 10/20 etc...I'm not aware of a hard threshold, the important thing, as above, is that you have an explanation of why you pick that # (although rarely you read one in papers :)).

lastly, if you are worried about data being skewed, you can also calculate skeweness of the data in R and evaluate in in that way. a value +/- 3 is a no go...but I don't think you will get anywhere close to it with the Z scores from this tutorial

hope this helps!

ADD REPLYlink written 22 months ago by TriS3.2k

@TriS, Thank you for the explanation!

 

ADD REPLYlink written 22 months ago by Les Ander100

hi, I'm wondering if you used 10% threshold above, how to plot survival curve with only those two lines (> HIGHTHRESH vs < -LOWTHRESH). thanks

ADD REPLYlink written 18 months ago by ShirleyDai20
0
gravatar for Les Ander
22 months ago by
Les Ander100
United States
Les Ander100 wrote:

I am interested in looking only upto 5 years (1825 days).

I'm new to survival analysis.

Is it correct to discard all the samples whose max(days_to_last_known_alive, days_to_last_followup) is <=1825

by doing this:

all_clin=all_clin[all_clin$new_death <= 1825,]

Is this correct? (I have a feeling that it is not because I am simply throwing away patient data who have lived beyond 5 years). If the above approach is not correct, can you please tell me what is the right way to do it?

Thanks

ADD COMMENTlink modified 8 months ago by Ram12k • written 22 months ago by Les Ander100

code seems fine to me but I'm a lil confused...you want to look up to 5 years but then you don't want to throw away patients that have lived beyond 5 years...if you decide a threshold that'll be applied to both living and deceased patients and you will get a snapshot of what happened up to that point in time.

as a general suggestion, look up some stuff on survival analysis and data subsetting. although I'm glad to see that this tutorial helps, it gives the basis on how to do it and it can be much more useful when you can grasp the concepts behind censored survival analysis and how/why to do it and tweak it to your needs

ADD REPLYlink modified 22 months ago • written 22 months ago by TriS3.2k
0
gravatar for ShirleyDai
18 months ago by
ShirleyDai20
ShirleyDai20 wrote:

This tutorial is really helpful to me. Can you give some scripts on how to perform comparison like "Up-regulated" vs "Down-regulated"? It would be better to add plot and legend functions. I tried to mask all event_rna==0 as NA (event_rna was coded same as in your script) and then remove them. But I found most genes were removed, including my interested gene.

Thanks

ADD COMMENTlink written 18 months ago by ShirleyDai20
2

You need a bit modification in above script:

library(survival)

z_rna (expression normalized data)

clinical data

then you have to calculate median of your gene

median(z_rna$yourgene)
rna_event <- ifelse(z_rna$yourgene >= (median), 1,0)     ## 1,0 ;  up & downregulated class

then follow similar above code...

ADD REPLYlink modified 8 months ago by Ram12k • written 18 months ago by Mike880
0
gravatar for ShirleyDai
18 months ago by
ShirleyDai20
ShirleyDai20 wrote:

Hi, I have a quick question about z-score you used. "mean gene X in normal" was used as "mean expression in diploid samples" in your script. Then should I only inquire paired tumor and normal samples in the TCGA data portal?

Thanks

ADD COMMENTlink written 18 months ago by ShirleyDai20
0
gravatar for zsucllj
18 months ago by
zsucllj0
zsucllj0 wrote:

Hi, thank you very much for your tutorials I have a quick question. I want to divide the tumor patients into two groups based on their mRNA expression (X gene) and then evaluate whether X gene mRNA expression level has any correlation with survival. Would you kindly tell me how to write the script? Thanks a lot : )

ADD COMMENTlink written 18 months ago by zsucllj0

See my question above. Mike gave the exact answer and related script.

ADD REPLYlink written 18 months ago by ShirleyDai20
0
gravatar for biostata
18 months ago by
biostata0
biostata0 wrote:

The error message is

clinical$IDs <- toupper(clinical$patient.bcr_patient_barcode)

Error in clinical$patient.bcr_patient_barcode : $ operator is invalid for atomic vectors

ADD COMMENTlink written 18 months ago by biostata0

Hey guy, would you please tell me how to solve it, thanks!

ADD REPLYlink written 10 weeks ago by hegelwyc0

check

class(clinical)

if it's a matrix, transform it into a data.frame

ADD REPLYlink written 10 weeks ago by TriS3.2k
0
gravatar for zsucllj
17 months ago by
zsucllj0
zsucllj0 wrote:

Hi, Should I make any changes to the above script? because I am just interested in the cancer patients but not the normal controls? Thanks!

ADD COMMENTlink written 17 months ago by zsucllj0

For this you need to further filter out samples from TCGA barcode , tumor types range from 01 - 09, normal types from 10 - 19 Control samples from 20 - 29, https://wiki.nci.nih.gov/display/TCGA/Working+with+TCGA+Data.

ADD REPLYlink written 17 months ago by Mike880
0
gravatar for zsucllj
17 months ago by
zsucllj0
zsucllj0 wrote:

Hi, Thank you very much for your help.Would you kindly tell me why the tumor types in this script are marked as 14? They should be 01-09. Thanks!

ADD COMMENTlink written 17 months ago by zsucllj0

Hi zsucllj,

in this script 14 is not tumor types/sample, this is the two digits at position 14-15 of the barcode (eg. TCGA-02-0001-01 , including "-"),

plase have a look at figure & table (https://wiki.nci.nih.gov/display/TCGA/Working+with+TCGA+Data)

ADD REPLYlink modified 17 months ago • written 17 months ago by Mike880

if you are replying to the same question please use the "add comment" link :)

ADD REPLYlink written 17 months ago by TriS3.2k

sorry TriS, I dont know whats different b/w "add comment" and "add reply". if my comments are not according to proper format , please delete them.

ADD REPLYlink modified 17 months ago • written 17 months ago by Mike880
0
gravatar for jzhou
17 months ago by
jzhou10
USA, California
jzhou10 wrote:

Hi, thank you very much for an informative tutorial! I apologize if this is an overly naive question, but I was wondering what new things could be learned from conducting your own survival analysis of TCGA data like in this tutorial when on Firehose there are already analyses of nearly every TCGA cancer data set including correlations between mRNAseq data and survival rates in their "Clinical Analysis" pages. Thanks!

ADD COMMENTlink written 17 months ago by jzhou10

When I look at correlations to survival with Firehose I find that they often list 0 genes correlating with survival. In contrast to their analysis, I have found thousands of genes that have significant p-values.

ADD REPLYlink written 16 months ago by Jordan Anaya790
0
gravatar for biobiu
17 months ago by
biobiu0
United States
biobiu0 wrote:

Hi, Great tutorial and thank you very much! I have some further comments and questions:

1) If someone is willing to get another/additional data from other sources, it is very important that the patients in "new_death", "death_event" and in the "rna event" will be in the same order in all of them (in this tutorial its unnecessary since all the data is sorted by the same way in gdac).

2) How can it be that some patients have a value which is not NA in the death_days data, but in patient.vital_status is "alive" (e.g TCGA-A3-3347, or TCGA-A3-3325), in both cases patient.follow_ups.follow_up.vital_status shows "dead" but patient.vital_status shows "alive", however it seems to me as a bug in the updated data uploaded to gdac. what do you think? Do you understand what is the difference between patient.vital_status to patient.follow_ups.follow_up.vital_status? If you will agree that this is kind of a bug, I suggest the following correction:

all_clin$death_event <- ifelse(all_clin$death_days =="NA" , 0,1)

instead of: all_clin$death_event <- ifelse(clinical$patient.vital_status == "alive", 0,1)

3) Just to make sure- 1)the reason that you got NA in the median survival is because that most of them are still alive? 2) Is the figure is up to date? because things are a bit difference (mean survival for altered is now 3554, for not altered is the same, the gdac data was upadated, do you believe that is the case?).

ADD COMMENTlink modified 17 months ago • written 17 months ago by biobiu0

I program mainly in Python so I haven't tried running the code in this thread, but a patient vital status issue was raised recently for OncoLnc: https://github.com/OmnesRes/onco_lnc/issues/1

In this repository you will find Python code for extracting clinical data from files downloaded from the old TCGA data portal which were named "clinical_follow_up" and "clinical_patient".

For OncoLnc I had to parse the clinical data for 21 different cancers, and most of the time in the files when a patient is 'dead' they have '[Not Applicable]' written in the 'last_contact_days_to' column, but in some cancers and for some patients they list a number in that column even when a patient is dead.

Given that this issue was most prevalent for GBM and OV and I believe those are two of the earlier studies performed by TCGA I can't help but wonder if TCGA changed how they report the follow_up times at some point and this resulted in an inconsistent file format.

To answer your point 3 I did notice that I would get 'NA' for median survival when I had more patients listed as 'alive' and once I fixed the code I had much fewer 'NA' values for median survival.

ADD REPLYlink written 16 months ago by Jordan Anaya790
0
gravatar for sharmi.banerji
16 months ago by
sharmi.banerji0 wrote:

Hello,

I have a question on the legend. I used this script in similar analysis except that I have a status column indication "altered" and "not altered". I was wondering if I use

# plot the data
 plot(survfit(Surv(as.numeric(as.character(all_clin$new_death))[ind_clin],all_clin$death_event[ind_clin])~event_rna[ind_gene,ind_tum]),
 col=c(1:3), frame=F, lwd=2,main=paste("KIRK",rownames(z_rna)[ind_gene],sep="\n"))

that means the color of the graph will be black for "altered" and red for "not altered" (as per order of the factors). But doesn't manually adding the legend like this

# add legend
legend=c(paste("NotAltered=",x1),paste("Altered=",x2)),bty="n",cex=1.3,lwd=3,col=c("black","red"))

swaps the legends of the two graphs? Or maybe I am missing something here.

Greatly appreciate help!

Thanks

ADD COMMENTlink written 16 months ago by sharmi.banerji0
0
gravatar for lazymaggie
15 months ago by
lazymaggie0
United States
lazymaggie0 wrote:

I feel something is wrong in your codes:

ind_tum <- which(unique(colnames(z_rna)) %in% rownames(all_clin))

ind_clin <- which(rownames(all_clin) %in% colnames(z_rna))

the patient's barcodes are not matched using the two ids:

rownames(all_clin)[ind_clin]

colnames(event_rna)[ind_tum]

both all_clin and event_rna have duplicated samples, so a solution about the above problem is using the following codes:

all_clin <- all_clin[!duplicated(rownames(all_clin)), ]  # remove duplicated samples in all_clin

event_rna <- event_rna[, !duplicated(colnames(event_rna))] # remove duplicated samples in event_rna

commonId <- intersect(colnames(event_rna), rownames(all_clin))  # obtain the common samples

ind_tum <- which(colnames(event_rna) %in% commonId) 

ind_clin <- which(rownames(all_clin) %in% commonId)
ADD COMMENTlink modified 8 months ago by WouterDeCoster24k • written 15 months ago by lazymaggie0

lazymaggie, thank you for pointing this out. I will check the code and edit where necessary

ADD REPLYlink written 14 months ago by TriS3.2k
0
gravatar for charco
14 months ago by
charco30
charco30 wrote:

Thanks for writing this up.

A question about the input data however. I thought that illuminahiseq_rnaseqv2-RSEM_genes_normalized was not RPKM (or more correctly FPKM for paired end data), but actually a quantile normalised raw count provided by RSEM (see here https://wiki.nci.nih.gov/display/TCGA/RNASeq+Version+2).

It would be important not to confuse the two.

ADD COMMENTlink written 14 months ago by charco30

it doesn't really matter in this case since they'll be transformed to zscores

ADD REPLYlink written 14 months ago by TriS3.2k

So which one is it?

Maybe it doesn't make a difference for your workflow, but it may do for others. For example, using voom on quantile normalised data may not be optimal. Also, if someone was performing downstream analyses that aggregated data across genes (eg gene set enrichment), you wouldn't want them to think they were using RPKM when they were actually using counts.

It is worth noting for the many people (including students) reading this thread who may go on to perform a variety of another analyses. Please consider correcting your description of the input data (if there is an error?).

ADD REPLYlink written 14 months ago by charco30

firebrowse (the file I suggested) provides you with RSEM values, not FPKM, not RPKM. for data normalization there are multiple ways of normalizing them and it depends from what you want to do. for the scope of this tutorial voom works well. if you prefer other methods DESeq2 or edgeR offer different flavors that might better fit your needs. also, bare in mind that what I described is a guide, not the only way of doing it.

ADD REPLYlink written 14 months ago by TriS3.2k

Are you sure? The file you suggested has 'normalized count' written in the header. RSEM gives transcripts per million (TPM) by default, which is not the same as a normalized count

ADD REPLYlink written 14 months ago by charco30

so, the normalized values are RSEM data divided by the 75th percentile and multiplied by 1000, as you correctly pointed out in the fist comment. to get TPM you have to multiply the RSEM estimate for 10^6 (RSEM paper) in this case, those are normalized RSEM values, not TPM, not RPKM, not FPKM. hope this helps

ADD REPLYlink written 14 months ago by TriS3.2k
0
gravatar for jzhou
13 months ago by
jzhou10
USA, California
jzhou10 wrote:

Hi, thanks again for a great tutorial. This might be a very basic question, but I was wondering if anyone could tell me what the numbers for Altered and NotAltered exactly refer to? Thanks!

ADD COMMENTlink written 13 months ago by jzhou10

number of samples with gene expression higher/lower than a specific threshold (i.e. z-score of 2) -> altered opposite -> not altered

ADD REPLYlink written 13 months ago by TriS3.2k
0
gravatar for cying
13 months ago by
cying10
China
cying10 wrote:

Hi, in regard to the "days_to_new_tumor_event_after_initial_treatment", I think the smaller value was more reasonable than the higher one. The DFS was defined as the time from initial treatment until the date of new tumor event. And the first time of new tumor event should be selected instead of the last time. This was opposite to the "days of last follow-up".

ADD COMMENTlink written 13 months ago by cying10

hi cying, thank you for spotting that, you are correct. this code went through several updates, and that's one of those I didn't implement yet.

ADD REPLYlink written 13 months ago by TriS3.2k
0
gravatar for 352182595
11 months ago by
3521825950
3521825950 wrote:

Hi, thanks for your great tutorial. Could you help me a question about survival analysis.In clinical, i want to updata survival function in a dynamic way but lack of multiple gene values to fit such a longitudinal model,so i want to use landmark approach to update the prognosis.However, i am not sure whether a genomic variable is a time-varying effect covariate,which means the effect of the gene might vary with time. Thank you for your patience,and hope that you could share me more detailes.

ADD COMMENTlink written 11 months ago by 3521825950

this question goes beyond what's described in this tutorial, you should start a new question in Biostar for that.

ADD REPLYlink written 11 months ago by TriS3.2k
0
gravatar for JP
10 months ago by
JP0
JP0 wrote:

thanks so much for the step-by-step tutorial, it has been very helpful for a biologist who is new to R and survival analysis.

I noticed that I am getting the following errors that don't allow for survfit() to function properly:

Error #1:

all_clin$new_death <- c() for (i in 1:length(as.numeric(as.character(all_clin$death_days)))){ + all_clin$new_death[i] <- ifelseis.na(as.numeric(as.character(all_clin$death_days))[i]), + as.numeric(as.character(all_clin$followUp_days))[i],as.numeric(as.character(all_clin$death_days))[i]) + }

Which gives me the following error:

In ifelseis.na(as.numeric(as.character(all_clin$death_days))[i]),  ... :

NAs introduced by coercion

I read above that NAs shoudn't be a problem with this step, so I continued. However, when I get to the survfit function I get the second error

Error #2:

> s <- survfit(Surv(as.numeric(as.character(all_clin$new_death))[ind_clin],all_clin$death_event[ind_clin])~event_rna[ind_gene,ind_tum])

Error in Surv(as.numeric(as.character(all_clin$new_death))[ind_clin], : Invalid status value, must be logical or numeric

Any help with this would be greatly appreciate, and I am sorry in advance if I don't understand your response right away, as I said, I am very new to R and survival analysis.

Thanks again for the great tutorial!

JP

PS- I am sorry if this comes up twice, I am new to biostars and still don't know the difference between answer/reply/comment.

ADD COMMENTlink modified 10 months ago • written 10 months ago by JP0
0
gravatar for Laura
8 months ago by
Laura0
Laura0 wrote:

Hi! Thanks for the tutorial! I'm new to the word of Bioinformatics and trying to get my head around everything.

Is there any documentation about what the different files on Firebrowse (http://gdac.broadinstitute.org/ ) contain? The tutorial links to a dead link of a wiki: https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode ). Does anyone know where the information from this wiki went?

I'm looking into potentially using one of those data sets for my masters thesis (I study computer science), but I'm a bit overwhelmed by all the files and little documentation.

Thanks a lot! Laura

ADD COMMENTlink modified 8 months ago by genomax39k • written 8 months ago by Laura0

the link was fixed, try using it now. basically those are information about which samples are normal or tumor

ADD REPLYlink modified 8 months ago • written 8 months ago by TriS3.2k

sorry moved to answers

ADD REPLYlink modified 7 months ago • written 7 months ago by mms14013050
0
gravatar for always_learning
8 months ago by
Doha, Qatar
always_learning800 wrote:
   > vm <- function(x){
   +   cond <- factor(ifelse(seq(1,dim(x)[2],1) %in% t_index, 1,  0))
    +  d <- model.matrix(~1+cond)
    +   x <- t(apply(x,1,as.numeric))
     +   ex <- voom(x,d,plot=F)
     +   return(ex$E)
   + }
  > rna_vm  <- vm(rna)

Error in if (any(allzero)) { : missing value where TRUE/FALSE needed In addition: Warning message: In apply(x, 1, as.numeric) : NAs introduced by coercion

I am getting above error while trying with Breast Cancer data set. Can you please help me in this ?

Thanks

ADD COMMENTlink written 8 months ago by always_learning800

I added the wrong comment

ADD REPLYlink modified 7 months ago • written 7 months ago by mms14013050
0
gravatar for mms140130
7 months ago by
mms14013050
mms14013050 wrote:
> for (i in 1:dim(new_tum)[1]){
+     ifsumis.na(new_tum[i,])) < dim(new_tum)[2]){
Error: unexpected ')' in:
"for (i in 1:dim(new_tum)[1]){
    ifsumis.na(new_tum[i,]))"
>         m <- min(new_tum[i,],na.rm=T)
Error: object 'i' not found
>         new_tum_collapsed <- c(new_tum_collapsed,m)
Error: object 'm' not found
>     } else {
Error: unexpected '}' in "    }"
>         new_tum_collapsed <- c(new_tum_collapsed,'NA')
>     }
Error: unexpected '}' in "    }"
> }
Error: unexpected '}' in "}"
 can you help why?
ADD COMMENTlink written 7 months ago by mms14013050
1
 for (i in 1:dim(new_tum)[1]){
   if ( sum ( is.na(new_tum[i,])) < dim(new_tum)[2]){
     m <- min(new_tum[i,],na.rm=T)
     new_tum_collapsed <- c(new_tum_collapsed,m)
   } else {
     new_tum_collapsed <- c(new_tum_collapsed,'NA')
   }
 }

You should open the parenthesis "(" before sum and is.na

ADD REPLYlink modified 6 months ago • written 6 months ago by Bioinfo90
0
gravatar for oriolebaltimore
6 months ago by
United States
oriolebaltimore30 wrote:

Hi Could you please help understand why you are conditioning on normal samples when applying voom.

Is this step normalizing tumor gene expression based on expression of same gene in normal samples?

cond <- factor(ifelse(seq(1,dim(x)[2],1) %in% t_index, 1, 0))

Thanks Adrian

ADD COMMENTlink written 6 months ago by oriolebaltimore30

the description is in the voom paper "voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Charity W Law, Yunshun Chen, Wei Shi and Gordon K Smyth"

for my understanding voom keeps in consideration the experimental design, i.e. treatments/replicates/etc, to weight variability across samples and conditions as a better way to estimate the mean-variance trend per each gene and therefore get a precise final value per each sample.

ADD REPLYlink written 6 months ago by TriS3.2k
0
gravatar for lenazhou
7 weeks ago by
lenazhou0
lenazhou0 wrote:

Hi Tris, do know whether will have your response so late. I am totally new in this and trying to use your codes to run the TCGA-LIHC dataset. Stuck here, could you kindly suggest the way to solve thanks a lot:

> clinical <- data.frame(clinical)
> clinical$IDs <- toupper(clinical$patient.bcr_patient_barcode)
Error in `$<-.data.frame`(`*tmp*`, IDs, value = character(0)) : 
  replacement has 0 rows, data has 40
ADD COMMENTlink modified 7 weeks ago by WouterDeCoster24k • written 7 weeks ago by lenazhou0

looks like you don't have a columns called

patient.bcr_patient_barcode

or it's empty, therefore you can't assign it to the variable you are creating.

ADD REPLYlink written 7 weeks ago by TriS3.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1489 users visited in the last hour