Tutorial: Survival analysis of TCGA patients integrating gene expression (RNASeq) data
70
gravatar for TriS
20 months ago by
TriS2.5k
United States
TriS2.5k wrote:

I found myself being often confused about how to do this and by various posts and tutorials online I realized that, if I could put together a step-by-step tutorial, that might be (hopefully) useful...so there we go.

first question: why would you want to do survival analysis based on gene expression data? well, let's say you have a number of genes that you are interested in and they are differentially expressed between tumor and normal samples, it would be very powerful to show that alteration in gene expression correlates with worse survival or earlier tumor recurrence.

here’s a brief overview of what’s in here:
1) obtain the data (RNASeq and clinical data)
2) prepare the data for the analysis
3) run the analysis and interpret the results

in this tutorial I will be using data from kidney cancer (clear cell renal cell carcinoma - KIRC) that you can get as follows:
1- ​go to FireBrowse (http://gdac.broadinstitute.org/), select “Kidney renal clear cell carcinoma” -> “Browse”
2- it should open a page with various data sets for KIRC
3- from “mRNASeq” select “illuminahiseq_rnaseqv2-RSEM_genes_normalized” and save it on your computer
4- from “Clinical” select “Merge_Clinical” and download it
5- unzip the files by double clicking on them
6- rename the folders as “RNA” and “Clinical”

now you can switch to R, you will need the package “survival” (https://cran.r-project.org/web/packages/survival/index.html

# set directory to where the RNA and Clinical folders are
setwd(“/User/myUser/tutorial”)
library(survival)

# read RNA file 
rna <- read.table("RNA/KIRC.rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.data.txt",nrows=20533, header=T,row.names=1,sep="\t")
# and take off first row cause we don’t need it
rna <- rna[-1,]

# and read the Clinical file, in this case i transposed it to keep the clinical feature title as column name
clinical <- t(read.table("Clinical/KIRC.merged_only_clinical_clin_format.txt",header=T, row.names=1, sep="\t"))

# first I remove genes whose expression is == 0 in more than 50% of the samples:
rem <- function(x){
  x <- as.matrix(x)
  x <- t(apply(x,1,as.numeric))
  r <- as.numeric(apply(x,1,function(i) sum(i == 0)))
  remove <- which(r > dim(x)[2]*0.5)
  return(remove)
}
remove <- rem(rna)
rna <- rna[-remove,]

now I need to identify normal and tumor samples. this is done using the TCGA barcode (https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode). The two digits at position 14-15 of the barcode will indicate teh sample type, from the link:
“Tumor types range from 01 - 09, normal types from 10 - 19 and control samples from 20 - 29.”

# see the values
table(substr(colnames(rna),14,14))
# 0      1 
# 534   72 

so we haev 534 tumor and 72 normal

# get the index of the normal/control samples
n_index <- which(substr(colnames(rna),14,14) == "1")
t_index <- which(substr(colnames(rna),14,14) == "0")


# apply voom function from limma package to normalize the data
vm <- function(x){
  cond <- factor(ifelse(seq(1,dim(x)[2],1) %in% t_index, 1,  0))
  d <- model.matrix(~1+cond)
  x <- t(apply(x,1,as.numeric))
  ex <- voom(x,d,plot=F)
  return(ex$E)
}

rna_vm  <- vm(rna)
colnames(rna_vm) <- gsub("\\.","-",substr(colnames(rna),1,12))


# and check how data look, they should look normally-ish distributed
hist(rna_vm)
# we can remove the old “rna” cause we don’t need it anymor
rm(rna)

 

now we can finally scale the data. the reason to do so is because we don’t want to use ONLY the fold changes. if we use FC then we average the expression values across all samples, losing the heterogeity that is characteristic of those data. we therefore transform data to z-score so that per each patient for each gene we will have a measure of how many SD away from the mean that is and we will consider those with Z > +/- 1.96 (roughly p=0.05 or 2 SD away) to be differentially expressed

to obtain z-scores for the RNASeq data we use following formula:
z = [(value gene X in tumor Y)-(mean gene X in normal)]/(standard deviation X in normal)

# calculate z-scores
scal <- function(x,y){
  mean_n <- rowMeans(y)  # mean of normal
  sd_n <- apply(y,1,sd)  # SD of normal
  # z score as (value - mean normal)/SD normal
  res <- matrix(nrow=nrow(x), ncol=ncol(x))
  colnames(res) <- colnames(x)
  rownames(res) <- rownames(x)
  for(i in 1:dim(x)[1]){
    for(j in 1:dim(x)[2]){
      res[i,j] <- (x[i,j]-mean_n[i])/sd_n[i]
    }
  }
  return(res)
}
z_rna <- scal(rna_vm[,t_index],rna_vm[,n_index])
# set the rownames keeping only gene name
rownames(z_rna) <- sapply(rownames(z_rna), function(x) unlist(strsplit(x,"\\|"))[[1]])

rm(rna_vm) #we don’t need it anymore

it might take some time, depending from your computer, be patient

now we can start using the clinical data

# match the patient ID in clinical data with the colnames of z_rna
clinical$IDs <- toupper(clinical$patient.bcr_patient_barcode)
sum(clinical$IDs %in% colnames(z_rna)) # we have 529 patients that we could use

# get the columns that contain data we can use: days to death, new tumor event, last day contact to....
ind_keep <- grep("days_to_new_tumor_event_after_initial_treatment",colnames(clinical))

# this is a bit tedious, since there are numerous follow ups, let's collapse them together and keep the first value (the higher one) if more than one is available
new_tum <- as.matrix(clinical[,ind_keep])
new_tum_collapsed <- c()
for (i in 1:dim(new_tum)[1]){
  if(sum(is.na(new_tum[i,])) < dim(new_tum)[2]){
    m <- min(new_tum[i,],na.rm=T)
    new_tum_collapsed <- c(new_tum_collapsed,m)
  } else {
    new_tum_collapsed <- c(new_tum_collapsed,"NA")
  }
}

# do the same to death
ind_keep <- grep("days_to_death",colnames(clinical))
death <- as.matrix(clinical[,ind_keep])
death_collapsed <- c()
for (i in 1:dim(death)[1]){
  if(sum(is.na(death[i,])) < dim(death)[2]){
    m <- max(death[i,],na.rm=T)
    death_collapsed <- c(death_collapsed,m)
  } else {
    death_collapsed <- c(death_collapsed,"NA")
  }
}

# and days last follow up here we take the most recent which is the max number
ind_keep <- grep("days_to_last_followup",colnames(clinical))
fl <- as.matrix(clinical[,ind_keep])
fl_collapsed <- c()
for (i in 1:dim(fl)[1]){
  if(sum(is.na(fl[i,])) < dim(fl)[2]){
    m <- max(fl[i,],na.rm=T)
    fl_collapsed <- c(fl_collapsed,m)
  } else {
    fl_collapsed <- c(fl_collapsed,"NA")
  }
}

# and put everything together
all_clin <- data.frame(new_tum_collapsed,death_collapsed,fl_collapsed)
colnames(all_clin) <- c("new_tumor_days", "death_days", "followUp_days")

now, to do survival analysis we need three main things:
1- time: this is the time till an event happens
2- status: this indicates which patients have to be kept for the analysis
3- event: this tells i.e. which patients have the gene up- or down-regulated or have no changes in expression

since we want to do censored analysis we need to have something to censor the data with. for example, if a patient has no death data BUT there is a date to last followup it means that after that day we know nothing about the patient, therefore after that day it cannot be used for calculations/Kaplan Meier plot anymore, therefore we censor it. so now we need to create vectors for both "time to new tumor" and 'time to death" that contain also the data from censored individuals

# create vector with time to new tumor containing data to censor for new_tumor
all_clin$new_time <- c()
for (i in 1:length(as.numeric(as.character(all_clin$new_tumor_days)))){
  all_clin$new_time[i] <- ifelse(is.na(as.numeric(as.character(all_clin$new_tumor_days))[i]),
                    as.numeric(as.character(all_clin$followUp_days))[i],as.numeric(as.character(all_clin$new_tumor_days))[i])
}

# create vector time to death containing values to censor for death
all_clin$new_death <- c()
for (i in 1:length(as.numeric(as.character(all_clin$death_days)))){
  all_clin$new_death[i] <- ifelse(is.na(as.numeric(as.character(all_clin$death_days))[i]),
                                 as.numeric(as.character(all_clin$followUp_days))[i],as.numeric(as.character(all_clin$death_days))[i])
}

now we need to create the vector for censoring the data which means telling R which patients are dead or have new tumor. in this case if a patient has a “days_to_death” it will be assigned 1, and used in the corresponding analysis. the reason why we censor with death events even for recurrence is pretty important. a colleague made me notice that this is a competitive risk problem, where, although a patient can recur and then die, if a patient is dead, it will not recur, therefore is more accurate to censor for death events.

# create vector for death censoring
table(clinical$patient.vital_status)
# alive  dead 
#   372   161 

all_clin$death_event <- ifelse(clinical$patient.vital_status == "alive", 0,1)

#finally add row.names to clinical
rownames(all_clin) <- clinical$IDs


now one more thing to do is to use the z-score values we obtained from the RNASeq data and define which samples are altered or do not change in this case I just look at mRNA deregulation, you can divide the data into up- and down- regulated too if you wish

# create event vector for RNASeq data
event_rna <- t(apply(z_rna, 1, function(x) ifelse(abs(x) > 1.96,1,0)))

# since we need the same number of patients in both clinical and RNASeq data take the indices for the matching samples
ind_tum <- which(unique(colnames(z_rna)) %in% rownames(all_clin))
ind_clin <- which(rownames(all_clin) %in% colnames(z_rna))

# pick your gene of interest
ind_gene <- which(rownames(z_rna) == "TP53")

# check how many altered samples we have
table(event_rna[ind_gene,])

# run survival analysis
s <- survfit(Surv(as.numeric(as.character(all_clin$new_death))[ind_clin],all_clin$death_event[ind_clin])~event_rna[ind_gene,ind_tum])
s1 <- tryCatch(survdiff(Surv(as.numeric(as.character(all_clin$new_death))[ind_clin],all_clin$death_event[ind_clin])~event_rna[ind_gene,ind_tum]), error = function(e) return(NA))

# extraect the p.value
pv <- ifelse(is.na(s1),next,(round(1 - pchisq(s1$chisq, length(s1$n) - 1),3)))[[1]]

# plot the data
plot(survfit(Surv(as.numeric(as.character(all_clin$new_death))[ind_clin],all_clin$death_event[ind_clin])~event_rna[ind_gene,ind_tum]),
     col=c(1:3), frame=F, lwd=2,main=paste("KIRK",rownames(z_rna)[ind_gene],sep="\n"))

# add lines for the median survival
x1 <- ifelse(is.na(as.numeric(summary(s)$table[,'median'][1])),"NA",as.numeric(summary(s)$table[,'median'][1]))
x2 <- as.numeric(summary(s)$table[,'median'][2])
if(x1 != "NA" & x2 != "NA"){
  lines(c(0,x1),c(0.5,0.5),col="blue")
  lines(c(x1,x1),c(0,0.5),col="black")
  lines(c(x2,x2),c(0,0.5),col="red")
}

# add legend
legend(1800,0.995,legend=paste("p.value = ",pv[[1]],sep=""),bty="n",cex=1.4)
legend(max(as.numeric(as.character(all_clin$death_days)[ind_clin]),na.rm = T)*0.7,0.94,
       legend=c(paste("NotAltered=",x1),paste("Altered=",x2)),bty="n",cex=1.3,lwd=3,col=c("black","red"))

now, the p.value in this case tells you that patients with altered TP53 have a worse survival…sad but true.
the same concept can be applied to time to new tumor (new_time) (disease free survival, or DFS) and the resulting graph is the following:

this approach is very flexible, I leave to the reader the satisfaction to subset the events in up/downregulation and/or add covariates.

I know there are a number of different ways of doing this, hopefully this will be clear enought that the reader can replicate this.

lastly, thanks for getting till down here, comments, critiques, corrections etc are always welcome :)

#### UPDATES ####

- used voom transformation for RNASeq data

- created the death event vector based on the vital status from the clinical data

- updated code and images based on the results

- corrected the last rna_log variable (leftover from previous versions) to rna_vm

- updated "min" to "max" for the "days_to_last_followup" based on cying comment

- updated "min" to "max" for the new_tumor and death

 

 

 

 

survival rna-seq tcga tutorial • 17k views
ADD COMMENTlink modified 7 weeks ago by mforde84440 • written 20 months ago by TriS2.5k
3

I think this is a great tutorial, thanks for posting!  

One thing I would add is to be careful of how you define a recurrence for the DFS data.  While overall survival is rather straightforward, DFS can get complicated and is tumor dependent.  For example, in head and neck cancer a patient is clinically classified as "WITH TUMOR" until the tests can be performed to verify the tumor is gone (which can be months from the original appointment).  This can result in a "WITH TUMOR" classification in TCGA data for the first followup when really the patient was tumor free, they just haven't had the test done yet.  Also, in HNSC tumors patients re-diagnosed with a tumor within 6 months of treatment completion it is considered part of the original tumor, not a new tumor, so it is residual tumor and not a recurrence.  For HNSC at least (according to my expert collaborators) a re-diagnosed patient is considered to have a recurrence of the original tumor if it's between 6 months and 5 years, and 5 years post-treatment it is considered a new primary tumor.  Of course these may be different for each tumor type.  So, if one is looking to publish survival plot for DFS I would urge them to talk to a specialist in the cancer type to get specific guidelines on what disease-free survival means, and then manually clean up the DFS time points and events before plotting using the treatment data as a guide. 

ADD REPLYlink written 19 months ago by alolex830

@alolex - thanks for the insightful feedback. you definitely make a valid point and I agree with you, the concept of DFS can be molded depending upon annotation accuracy/frequency and tumor type. and also, yes, before publishing any study on clinical data I would warmly suggest to talk to a clinician/statistician to make sure that the results make sense.

as for the KIRC data go, there is a "person_neoplasm_cancer_status" field that indicates whether the patient is "with tumor" or "tumor free". for the purpose of the tutorial I censored based on the presence of a "days_to_new_tumor_event" value but I agree it is more accurate to use the "with tumor" or "tumor free" indicators. I will update the code asap. thaks

ADD REPLYlink written 19 months ago by TriS2.5k
1

@TriS thank you for this useful tutorial !

I think it would be better to change the lg2 function to:

lg2 <- function(x){
  x <- as.matrix(x)
  x <- t(apply(x,1,as.numeric))
  res <- log2((x+1))
  return(res)
}
ADD REPLYlink written 19 months ago by al3n70rn90

@al3n70rn - thanks for the feedback, I'll change that. I was also looking into the VST transformation from DESeq2 which might even be a better choice. I'll add that asap 

ADD REPLYlink written 19 months ago by TriS2.5k

@TriS - Do you have a github account?  It might be a good idea to post this script there so people can easily download and make code change suggestions via Github.

ADD REPLYlink written 19 months ago by alolex830

@alolex - that's a good idea ,thanks for the suggestion. I'll update the code this evening and then upload it on github as soon as I understand how that works.

ADD REPLYlink written 19 months ago by TriS2.5k
1

Great tutorial, Thanks!

I had a problem when using the TCGA BRCA data, the rnal_log is not being created. So downstream analysis is not performed. It seems the n_index and t_index are formed and also the voom function is fine.

Thanks a lot for your help!

ADD REPLYlink written 16 months ago by jyothipadiadpu10

that is actually a typo on my side, my apologies, it's the rna_vm, not rna_log, that's probably leftover from the first version i wrote, without adding the latest edits.

thanks for catching it, i'm gonna fix it now

ADD REPLYlink modified 16 months ago • written 16 months ago by TriS2.5k

 Great ! it worked now. Thanks a lot.

ADD REPLYlink written 16 months ago by jyothipadiadpu10

Thank you so much for a great tutorial.

Sorry if these queries are a bit basic - probably need some more R learning!

Firstly, could I please ask how we would go about limiting this to, for example, five year overall survival?

Finally, how would I go about limiting the results to say just those who had radiotherapy (patient.radiation_therapy = yes)

Thanks again!

ADD REPLYlink written 16 months ago by jcfleming.ent0

there are a couple of things you could do.

1) keep the survival/follow up time only up to 5 years -> 5(yrs)*365(days)=1825(days)

2) treatment data should be in the merged survival file, if not they are in one of the other files you can download. you need to create a dummy variable with therapy_yes=1 therapy_no=0 and select out only the patients with a 1..there are a million ways of selecting them out, this is just one of them.

3) last but most important, you might want to look up some basic survival analysis in R and how to extract subsets of data. this sounds basic but it's vital since it's what allows you to do things correctly and what you will  refer to when you want to do a lil trickier analysis like the one you are asking 

ADD REPLYlink written 16 months ago by TriS2.5k

Please discard

ADD REPLYlink modified 14 months ago • written 14 months ago by Les Ander80

Please discard

ADD REPLYlink modified 14 months ago • written 14 months ago by Les Ander80

TriS Thank you so much for this tutorial. I found it very helpful.

I am interested in comparing the patients with overexpression of a gene and underexpression of that gene. To this end, I want to compare survival of patients with Z-score in the range <-1.96, versus [-1,1] versus > 1.96.

Is it correct to do the following?

event_rna <- t(apply(z_rna, 0, function(x) ifelse(x > 1.96,2,ifelse(x < -1.96,1,0)))) # for up and down regulated
 
ADD REPLYlink modified 14 months ago • written 14 months ago by Les Ander80
1

yes, that should work, however, the apply function you wrote has an error/typo, you need to use "1" to apply the function row by row

event_rna <- t(apply(z_rna, 1, function(x) ifelse(x > 1.96,2,ifelse(x < -1.96,1,0)))) 
ADD REPLYlink written 14 months ago by TriS2.5k

@TriS Thank you so much!

 

ADD REPLYlink written 14 months ago by Les Ander80

Deleted the comment and instead put it as a new comment

ADD REPLYlink modified 14 months ago • written 14 months ago by Les Ander80

Hi thanks for the great tutorial.  Was there a particular reason you used the normalized gene count as opposed to the raw or scaled estimate?  thank you! 

ADD REPLYlink written 13 months ago by ahdee0

yes and no. those data were available from Firehose but you can apply the same pipeline to other measurements i.e. raw counts

ADD REPLYlink written 13 months ago by TriS2.5k

Thank you - reason why I ask is because I'm wondering what the reasoning behind using the the voom transformation?  I would assume since the data is "normalized" wouldn't we be able to bypass this step?   sorry I'm just trying to understand this a bit more so I'm wondering if you can elaborate on this a bit more. thanks

ADD REPLYlink written 13 months ago by ahdee0

no probs, asking is good. you could use other transformations too. voom transformation is very practical and better than i.e. just taking the log2. here and here there are a few info about why voom is a good method + you can also use it with RPKM, which were the data available from TCGA. however, if you had raw counts you could have used DESEq2 to transform the data, for the sake of this tutorial and this analysis I don't think it will make a big difference (note: this should be tested!!). I do prefer to normalize because of the skeweness of the RNASeq/RPKM data which could inflate the actual differences between the different samples/genes.

ADD REPLYlink written 13 months ago by TriS2.5k

you rock! thank you - btw you should put this on github.  

 

ADD REPLYlink written 13 months ago by ahdee0

I am very new to this kind of analysis and try to do my first using your great tutorial. I took a look at this link which I think is the paper introducing voom. It says that this method can be used with RPKM values, but RNAseq data you provided in this tutorial is normalized with RSEM method. Does this make any difference?

ADD REPLYlink written 12 months ago by Vasei10

Hi quick question. Do you think it is better to eliminate the normal samples all together in the z-score calculations?

ADD REPLYlink written 12 months ago by ahdee0

link to the Firehouse should be updated

from

http://gdac.broadinstitute.org/)

to

http://gdac.broadinstitute.org/

excluding the closing paranthesis.

ADD REPLYlink written 10 months ago by bounlu80

Great post. I have several questions/feedback:

  1. Why did you use KIRC.merged_only_clinical_clin_format.txt instead of KIRC.clin.merged.txt?
  2. There is a Clinical_Pick_Tier1 file when you go to FireBrowse > KIRC > Clinical. Do you know what it is and how it is different than the one you used? I just checked the final numbers for new_death and they correspond to days_to_death (for event dead) and days_to_last_followup (for event alive) combined in the Clinical_Pick_Tier1 file. If you use that file, you might save all the processing that you are doing to get the numbers for new_death.
  3. Although the numbers for survival match, your death_event is different than those given in the file (given as vital_status). I don't know how that file was processed so cannot comment on who is incorrect but wanted to put it out there.
  4. Why are you using min for new_tum_collapsed instead of max when you just want the highest value?
  5. For each of the collapse methods, I had to convert the columns to numeric for e.g. fl <- apply(fl, 2, as.numeric) otherwise I had character and factors which messed up the min/max values.

Thanks!

ADD REPLYlink modified 3 months ago • written 3 months ago by komal.rathi2.7k
  1. you can use either, I just picked one. this tutorial is a guideline, not a rule
  2. I didn't know that was available at the time of this tutorial, thank you for bringing that up, it is very useful
  3. this post is almost a year and a half old, I would expect some differences in the events
  4. because I took the time at which the first tumor re-appeared, not the last one
  5. ok
ADD REPLYlink written 3 months ago by TriS2.5k

Hi, Tris

Thanks for the great tutorial. it is exactly what I want to do on colorectal cancer datasets. I'm an extreme newbie to bioinformatics, being able to run it on my dataset would be something big already. but I seem to have some difficulties in plotting median survival, when I run it on either your original dataset or the colorectal cancer dataset it gives me error message as follow:

x1 <- ifelseis.na(as.numeric(summary(s)$table[,'median'][1])),"NA",as.numeric(summary(s)$table[,'median'][1])) x2 <- as.numeric(summary(s)$table[,'median'][2]) if(x1 != "NA" & x2 != "NA"){ + lines(c(0,x1),c(0.5,0.5),col="blue") + lines(c(x1,x1),c(0,0.5),col="black") + lines(c(x2,x2),c(0,0.5),col="red") + } Error in if (x1 != "NA" & x2 != "NA") { : missing value where TRUE/FALSE needed

Would you mind help me to look into this, if you get a chance? I truly have no idea of how to correct this. Very much appreciated

Cheers,

ADD REPLYlink written 10 weeks ago by tongtong.wang12100
clinical <- t(read.table("Clinical/BRCA.merged_only_clinical_clin_format.txt", header = T, row.names = 1, sep = "\t"))

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :

line 1141 did not have 1098 elements

What should I do for the above error?

ADD REPLYlink modified 9 weeks ago by WouterDeCoster14k • written 9 weeks ago by ragsadhav0
1

I moved your post to a comment (because it is no answer to the original thread).

Your 1141th line doesn't have the same number of elements (columns) as the rest of the file, so you should have a look at that line and see if something is odd.

awk 'NR==1141' yourfile.txt

I'm not sure if the header is counted as a line by R, so perhaps you should also check the next line:

awk 'NR==1142' yourfile.txt

and compare

Or is 1141/1142 the last line in your file containing whitespaces?
Have a look at the number of lines:

wc -l yourfile.txt
ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by WouterDeCoster14k

double post. Sorry, new to biostars. re-posted as "answer" below.

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by JP0
4
gravatar for gitarrestunden
17 months ago by
Austria
gitarrestunden40 wrote:

Hi!

Thank you for this tutorial!! I trid the KIRK dataset and overall, it worked great!

I had to change the following line because of an error:

clinical <- t(read.table("Clinical/KIRC.merged_only_clinical_clin_format.txt",header=T, row.names=1, sep="\t"))

changed to:

clinical <- read.table("Clinical/KIRC.merged_only_clinical_clin_format.txt",header=T, row.names=1, sep="\t")
clinical <- as.data.frame(t(clinical))

Reason:

clinical$IDs <- toupper(clinical$patient.bcr_patient_barcode)
Error in toupper(clinical$patient.bcr_patient_barcode) : 
  error in evaluating the argument 'x' in selecting a method for function 'toupper': Error in clinical$patient.bcr_patient_barcode : 
  $ operator is invalid for atomic vectors

In addition, I get an error when I want to calculate p-values, however, I cant solve the problem:

pv <- ifelseis.na(s1),next,(round(1 - pchisq(s1$chisq, length(s1$n) - 1),3)))[[1]]
Error in ifelseis.na(s1), next, (round(1 - pchisq(s1$chisq, length(s1$n) -  : 
  error in evaluating the argument 'yes' in selecting a method for function 'ifelse': Error: no loop for break/next, jumping to top level

Can someone help me to fix this?

Regards, G

ADD COMMENTlink written 17 months ago by gitarrestunden40

Ok, got it, this one works:

> pv <- if is.na(s1)) next else(round(1 - pchisq(s1$chisq, length(s1$n) - 1),7))[[1]]

Regards, G

ADD REPLYlink modified 17 months ago • written 17 months ago by gitarrestunden40

@gitarrestunden - thanks for the feedback. I see a few issues in the code that you wrote, and that includes parenthesis and spaces between words. I'm not sure how the following line works since the parenthesis are off :

pv <- if is.na(s1)) next else(round(1 - pchisq(s1$chisq, length(s1$n) - 1),7))[[1]]

and it should be more like :

pv <- ifelseis.na(s1), next,(round(1 - pchisq(s1$chisq, length(s1$n) - 1),7))[[1]]

 

ADD REPLYlink written 17 months ago by TriS2.5k

Yes, I know what you mean, but the ifelse statement does not work in my system. I tried with correct parentheses and with or without commas/spaces...

> pv <- ifelseis.na(s1)),next,(round(1 - pchisq(s1$chisq, length(s1$n) - 1),7))[[1]]
Error: unexpected ',' in "pv <- ifelseis.na(s1)),"

This is the only working version on my system:

> pv <- if is.na(s1)) next else(round(1 - pchisq(s1$chisq, length(s1$n) - 1),7))[[1]]

Strange..

 

 

ADD REPLYlink written 17 months ago by gitarrestunden40

And it is also strange that the first parenthesis which should come directly after the "if" always disappears when I click the add comment button!!

ADD REPLYlink written 17 months ago by gitarrestunden40

hmm...I wonder if there is a "lost" parenthesis somewhere else in your code

the ifelse function in R requires the commas and the reason why it tells you that there is an unexpected "," is because there is a parenthesis missing. the code you wrote is a one liner that sums up the formal if/else statements

for the parenthesis not showing up in the forum, I really have no idea :)

ADD REPLYlink written 17 months ago by TriS2.5k

Hi Tris thank you for the tutorial, it is very useful. I get a similar error in generating p- value. Also some of the corrections made in the command line still shows an error. Can someone help fix this. Thank you in advance.

pv <- ifelseis.na(s1),next,(round(1 - pchisq(s1$chisq, length(s1$n) - 1),3)))[[1]]

Error in ifelseis.na(s1), next, (round(1 - pchisq(s1$chisq, length(s1$n) - : no loop for break/next, jumping to top level

pv <- if is.na(s1)) next else(round(1 - pchisq(s1$chisq, length(s1$n) - 1),7))[[1]]

Error: unexpected symbol in " pv <- if is.na"

pv <- ifelseis.na(s1), next,(round(1 - pchisq(s1$chisq, length(s1$n) - 1),7))[[1]]

Error: unexpected ',' in "pv <- ifelseis.na(s1),"

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by devi.jaganathan0

it seems that there is a problem on the comment above where some brackets don't show up.

pv <- ifelse ( is.na(s1),next,(round(1 - pchisq(s1$chisq, length(s1$n) - 1),3)))[[1]]
ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by TriS2.5k

Hi Tris, Thanks a lot for the correction.

ADD REPLYlink written 10 weeks ago by devi.jaganathan0
1
gravatar for vibes1002003
16 months ago by
vibes100200320
Germany
vibes100200320 wrote:

I would like to know how can extend this code for survival analysis of multiple genes altogether or cancer subtypes ?  I want to check effect of set of genes/ subtypes on survival probability through KM plot?

ADD COMMENTlink written 16 months ago by vibes100200320
1

you should create a dummy variable in the survival analysis where you select patients with the gene signature altered

i.e. for genes A, B and C, you have a matrix m with row=genes, cols=patients: 

gene_sig <- c(A,B,C)
rna_data_signature <- rna_data[gene_sig,]
event_rna <- apply(rna_data, 2, function(x) ifelse(sum(abs(x) > 1.96 >= 1),1,0))

not tested but it should work. the idea is to use apply to select, by column, patients that have at least one gene in the signature altered

you can than use the event_rna vector in the survival analysis

ADD REPLYlink written 16 months ago by TriS2.5k

How can we make non-altered case since we need two groups -altered and unaltered for generating KM plot ? for instance you take those genes which are altered at least once in all the patients then how can we make unaltered case? 

ADD REPLYlink modified 16 months ago • written 16 months ago by vibes100200320

if you look at the code I put in the answer, the patients with at least one alteration will be assigned a 1, if they have no alterations they will be assigned a zero. it basically checks column by column (= patient by patient) how many of those genes are altered, if the # of altered genes is > 1, then they enter the "altered" group.

ADD REPLYlink written 16 months ago by TriS2.5k
1
gravatar for cying
12 months ago by
cying10
China
cying10 wrote:

Hi, TriS. Your tutorial is very helpful.

I have done my own survival analysis using clinical file before merged and cleaned, but actually, the merge clinical file downloaded from the FireBrowse is more concise. Here I list two questions:

  1. "days_to_last_followup", you said that get the minmum because the it's the latest number.But the definition of "days_to_last_followup" is time interval from the date of last followup to the date of initial pathologic diagnosis, which means the the longer days represent latest followups, so the days should take the maximum number.

  2. to plot DFS survival curve, a vector contains censored or completed status should be created. In the merge clinical file, the column "patient.new_tumor_events.new_tumor_event_after_initial_treatment" defines the new tumor event status "yes", "no", or "NA".How to understand "NA" status? Does it mean the patient haven't occur new tumor event at the time of the last followup or means haven't survey the new tumor event status at the time of the last followup? The variable "all_clin$new_time" in your code shows you have regarded the "NA" as "haven't occur new tumor event"(the same as status "no").Is that Right?

Thanks.

ADD COMMENTlink modified 12 months ago • written 12 months ago by cying10

hi cying, you made very good points and thank you for posting them. 1. yes, you are right! I overlooked the definition of "days to last followup" and you are correct, you should use the highest one. I will change the code accordingly 2. the NA in all_clin$new_time were counted as event that didn't happen, in this case no recurrence happened. to be honest I'm not sure about the difference between "no" and "NA" in "new_tumor_event_after_initial_treatement". i.e. there is an xls file for melanoma enrollment that annotates the field as "Indicate whether the patient had a new tumor event (e.g. metastatic, recurrent, or new primary tumor) after the date of initial melanoma diagnosis. If the tumor that yielded the sample submitted for TCGA was a new tumor event (i.e. it was diagnosed after the initial diagnosis), the remaining questions should not be completed. Please Note:If the patient did not have a new tumor event, the remaining questions can be skipped." therefore when they annotate for no/NA they might indicate the same thing, and it actually depends from who annotates the data. hope this helps!

ADD REPLYlink written 12 months ago by TriS2.5k

That's a good point, maybe the the blank space of "new_tumor_event_after_initial_treatement" column is a hint of no new tumor events. Thanks!

ADD REPLYlink written 12 months ago by cying10
1
gravatar for enxxx23
3 months ago by
enxxx23160
Finland
enxxx23160 wrote:

1) would be possible to get also the code which generates the DFS plot also?

2) is there somewhere some cleaned up and up to date version of this code?

3) when running this code (and fixing myself some bugs which still has like for example adding "clinical = data.frame(clinical)" ) the KIRK-OS plot looks slightly different in the sense that I get NotAltered = NA and also I get a lot of warnings when computing all_clin$new_time ,like for example, " In ifelseis.na(as.numeric(as.character(all_clin$new_tumor_days))[i]), ... : NAs introduced by coercion"). Otherwise the plot looks similar. Is this normal?

ADD COMMENTlink modified 3 months ago • written 3 months ago by enxxx23160
1

1) for the DFS plot you can use the DFS data in the clinical file. 2) no, the version here is what I had the time to create 3) you will get some warnings, if your code is correct these warnings come from the data and it's ok, some will be NA

ADD REPLYlink written 3 months ago by TriS2.5k
1
gravatar for JP
7 weeks ago by
JP0
JP0 wrote:

thanks so much for the step-by-step tutorial, it has been very helpful for a biologist who is new to R and survival analysis.

I noticed that I am getting the following errors that don't allow for survfit() to function properly:

Error #1:

all_clin$new_death <- c() for (i in 1:length(as.numeric(as.character(all_clin$death_days)))){ + all_clin$new_death[i] <- ifelseis.na(as.numeric(as.character(all_clin$death_days))[i]), + as.numeric(as.character(all_clin$followUp_days))[i],as.numeric(as.character(all_clin$death_days))[i]) + }

Which gives me the following error:

In ifelseis.na(as.numeric(as.character(all_clin$death_days))[i]),  ... :

NAs introduced by coercion

I read above that NAs shoudn't be a problem with this step, so I continued. However, when I get to the survfit function I get the second error

Error #2:

> s <- survfit(Surv(as.numeric(as.character(all_clin$new_death))[ind_clin],all_clin$death_event[ind_clin])~event_rna[ind_gene,ind_tum])

Error in Surv(as.numeric(as.character(all_clin$new_death))[ind_clin], : Invalid status value, must be logical or numeric

Any help with this would be greatly appreciate, and I am sorry in advance if I don't understand your response right away, as I said, I am very new to R and survival analysis.

Thanks again for the great tutorial!

JP

ADD COMMENTlink written 7 weeks ago by JP0
1
gravatar for mforde84
7 weeks ago by
mforde84440
mforde84440 wrote:

Hi TriS,

Thank you very much for posting this. It was very helpful.

I hope you don't mind, but I decided to fork your code up to github, so that there's an example that's runs out of box. https://github.com/mforde84/RNAseq-Survival-Analysis-TCGA-KIRC

My results are slightly different than yours, so I would very much appreciate if you would review my code and see if anything erroneous sticks out. I simplified a number of things as well (like the input data set and preprocessing steps), just to help improve readability.

I think the differences are due to two things: 1)different versions of the RNAseq dataset in question (mine was accessed on Feb 2, 2017), and 2) specific R functionality has changed in less obvious ways which may break or alter the results of the analysis. For 1), collapsing and calculating the censors using the omics data provided generates slightly different results, but only for a handful of samples. Not much, but there certainly is a difference. For two, e.g., reading the clinical omics table into a R now creates an atomic character vector which doesn't utilize the $ operator, so later in your scripting you'll throw an error:

> clinical <- t(read.table("KIRC.merged_only_clinical_clin_format.txt", header=T, row.names=1, sep="\t"))
> typeof(clinical)
[1] "character"
> clinical$IDs <- toupper(clinical$patient.bcr_patient_barcode)
Error in clinical$patient.bcr_patient_barcode : 
  $ operator is invalid for atomic vectors
> clinical <- data.frame(clinical) #working solution

Not a big deal. Just thought it would be helpful to keep something current for anyone who want to learn.

Thanks again, Martin

ADD COMMENTlink written 7 weeks ago by mforde84440

hi Martin

thank you for putting it on github, that's something I should have done a long time ago. and thank you for the updates as this post is more than one year old, I would expect to see some differences with what I posted at that time. clinical data are/should be updated and that will change, slightly, the results. Seb

ADD REPLYlink written 7 weeks ago by TriS2.5k
0
gravatar for nashtf
17 months ago by
nashtf0
United States
nashtf0 wrote:

I tried running this with the BRCA data and get the following error on reading in the merged clinical data:

clinical <- t(read.table("Clinical/BRCA.merged_only_clinical_clin_format.txt",header=T,row.names=1,sep="\t"))

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 2 did not have 1099 elements
ADD COMMENTlink written 17 months ago by nashtf0
2

add 

quote="" 

to the read.table() function, that should fix it

ADD REPLYlink written 17 months ago by TriS2.5k
0
gravatar for gitarrestunden
17 months ago by
Austria
gitarrestunden40 wrote:

Hi!

One last question: You wrote that "you can divide the data into up- and down- regulated too if you wish"

I guess i have to change this line, but I´m not sure how to do that correctly:

event_rna <- t(apply(z_rna, 1, function(x) ifelse(abs(x) > 1.96,1,0)))

 

Thanks for help again!!

ADD COMMENTlink written 17 months ago by gitarrestunden40
1
event_rna <- t(apply(z_rna, 1, function(x) ifelse(x > 1.96,1,ifelse(x < -1.96,2,0))))
ADD REPLYlink written 17 months ago by TriS2.5k
0
gravatar for sunqiangzai
16 months ago by
China
sunqiangzai0 wrote:

Great tutorial!I am curious about the number of the altered and not-altered group,since the number of patients are 529,why the altered group are so large?

ADD COMMENTlink written 16 months ago by sunqiangzai0

I suppose you refer to the numbers in the legend. those are not the number of patients but the median survival in days in the altered and not altered groups. to get the months you can do (ndays/365)*12

ADD REPLYlink written 16 months ago by TriS2.5k
0
gravatar for Palgrave
14 months ago by
Palgrave20
Singapore
Palgrave20 wrote:

Hi all, great tutorial.

I have two questions:

1) When calculating p-values I get this error, similar as above:

pv <- if is.na(s1)) next else(round(1 - pchisq(s1$chisq, length(s1$n) - 1),7))[[1]]     #error
Error: unexpected symbol in "pv <- if is.na"

 

2) Is there a method to find the relevant genes that are significant?

 

Best wishes,

R

ADD COMMENTlink modified 14 months ago • written 14 months ago by Palgrave20

1) When calculating p-values I get this error, similar as above:

pv <- if is.na(s1)) next else(round(1 - pchisq(s1$chisq, length(s1$n) - 1),7))[[1]]     #error
Error: unexpected symbol in "pv <- if is.na"

your parenthesis are off, try this (not tested):

pv <- ifelse(is.na(s1),next,(round(1 - pchisq(s1$chisq, length(s1$n) - 1),7)))[[1]]

2) Is there a method to find the relevant genes that are significant?

you mean differentially expressed genes? you might want to read a lil about z-scores and differential gene expression analysis or limma

ADD REPLYlink written 14 months ago by TriS2.5k

Is it possible to run through all my rownames and print all corresponding plots and p-values:

ind_gene <- which(rownames(z_rna) == "TP53")
ADD REPLYlink written 14 months ago by Palgrave20

yes, using for loops and even writing it as a function :)

however, you might not want to print 15k to 30k plots, you might want to find a way to keep those you are interested in

ADD REPLYlink modified 14 months ago • written 14 months ago by TriS2.5k

Would you mind helping, I tried but got errors

ADD REPLYlink written 14 months ago by Palgrave20
1

I'm going to give you a pseudocode that you can apply to your analysis:

for 1 to # of genes
  run survival analysis
  extract p.value
  if p.value is significant & some other conditions
    plot the graph

...learning how to code does come with LOTS of trial and error, it's good that you got errors cause it can show you what you did wrong.

also, your question then becomes more general on the lines of "how to apply a function to multiple genes", which can be 1) googled 2) posted on the general forum since it's not specific anymore to this tutorial

ADD REPLYlink written 14 months ago by TriS2.5k
0
gravatar for Les Ander
14 months ago by
Les Ander80
United States
Les Ander80 wrote:

When I use the GBM data and the TP53, I get an error in the vm function (at the    

    d <- model.matrix(~1+cond)  

step.

The error message is

   Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
      contrasts can be applied only to factors with 2 or more levels
   Called from: `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]])

 

Any idea how to fix this?

Thanks

ADD COMMENTlink modified 14 months ago • written 14 months ago by Les Ander80

can you check that the n_index and t_index are not empty? also that cond does contain data?

ADD REPLYlink written 14 months ago by TriS2.5k
0
gravatar for Les Ander
14 months ago by
Les Ander80
United States
Les Ander80 wrote:

The histogram of z-scores is not quite symmetric, it tends to have a heavier tail on the lower end (i.e. there are more genes below summit) so it does not look like that > 1.96 is equivalent to p=0.05 (my guess is that it is much lower than that. Similarly, I don't think that the Z < -1.96 is p=0.05 either (my guess is that it is much higher than that). Any suggestions how to deal with such asymmetric distribution? My inclination would be to use percentiles i.e. top 10% of the data versus bottom 10% of the data. However, I am not sure how to modify the code to do that. Any suggestions would be much appreciated.

I think the change has to be something like below but not sure if it is correct:

pr_thresh=quantile(z_rna, c(0.1, 0.9))
LOWTHRESH=pr_thresh[1]
HIGHTHRESH=pr_thresh[2]
event_rna <- t(apply(z_rna, 1, function(x) ifelse(x > HIGHTHRESH,2,ifelse(x < -LOWTHRESH,1,0)))) 

Any thoughts on the use of percentiles and/or on the implementation?

Thanks

ADD COMMENTlink modified 14 months ago • written 14 months ago by Les Ander80
1

the data won't always be perfectly symmetric, you can cut your data so that you have the same number of patients in each side in order to balance it out when doing survival analysis. however, when we are talking about hundreds of patients, small differences in n do not cause issues. some statisticians here use even z=+/-1.5 as long as it gives them symmetric/even groups.

the 1.96 threshold is not a must, cBioportal uses 2, you can use what you prefer and what fits your needs as long as there is a rationale behind it. also, the z scores here are calculated comparing T to N so you won't perfectly center the data around 0 as if you would scale the data within the same sample group.

for percentile you can use the top 10/20 etc...I'm not aware of a hard threshold, the important thing, as above, is that you have an explanation of why you pick that # (although rarely you read one in papers :)).

lastly, if you are worried about data being skewed, you can also calculate skeweness of the data in R and evaluate in in that way. a value +/- 3 is a no go...but I don't think you will get anywhere close to it with the Z scores from this tutorial

hope this helps!

ADD REPLYlink written 14 months ago by TriS2.5k

@TriS, Thank you for the explanation!

 

ADD REPLYlink written 14 months ago by Les Ander80

hi, I'm wondering if you used 10% threshold above, how to plot survival curve with only those two lines (> HIGHTHRESH vs < -LOWTHRESH). thanks

ADD REPLYlink written 9 months ago by ShirleyDai20
0
gravatar for Les Ander
14 months ago by
Les Ander80
United States
Les Ander80 wrote:

I am interested in looking only upto 5 years (1825 days).

I'm new to survival analysis.

Is it correct to discard all the samples whose max(days_to_last_known_alive, days_to_last_followup) is <=1825

by doing this:

    all_clin=all_clin[all_clin$new_death <= 1825,] 

Is this correct? (I have a feeling that it is not because I am simply throwing away patient data who have lived beyond 5 years). If the above approach is not correct, can you please tell me what is the right way to do it?

Thanks

 

ADD COMMENTlink written 14 months ago by Les Ander80

code seems fine to me but I'm a lil confused...you want to look up to 5 years but then you don't want to throw away patients that have lived beyond 5 years...if you decide a threshold that'll be applied to both living and deceased patients and you will get a snapshot of what happened up to that point in time.

as a general suggestion, look up some stuff on survival analysis and data subsetting. although I'm glad to see that this tutorial helps, it gives the basis on how to do it and it can be much more useful when you can grasp the concepts behind censored survival analysis and how/why to do it and tweak it to your needs

ADD REPLYlink modified 14 months ago • written 14 months ago by TriS2.5k
0
gravatar for ShirleyDai
9 months ago by
ShirleyDai20
ShirleyDai20 wrote:

This tutorial is really helpful to me. Can you give some scripts on how to perform comparison like "Up-regulated" vs "Down-regulated"? It would be better to add plot and legend functions. I tried to mask all event_rna==0 as NA (event_rna was coded same as in your script) and then remove them. But I found most genes were removed, including my interested gene.

Thanks

ADD COMMENTlink written 9 months ago by ShirleyDai20
2

You need a bit modification in above script:

library(survival)

z_rna (expression normalized data)

clinical data

then you have to calculate median of your gene

median(z_rna$yourgene)

rna_event <- ifelse(z_rna$yourgene >= (median), 1,0) ## 1,0 ; up & downregulated class

then follow similar above code...

ADD REPLYlink modified 9 months ago • written 9 months ago by Mike500
0
gravatar for ShirleyDai
9 months ago by
ShirleyDai20
ShirleyDai20 wrote:

Hi, I have a quick question about z-score you used. "mean gene X in normal" was used as "mean expression in diploid samples" in your script. Then should I only inquire paired tumor and normal samples in the TCGA data portal?

Thanks

ADD COMMENTlink written 9 months ago by ShirleyDai20
0
gravatar for zsucllj
9 months ago by
zsucllj0
zsucllj0 wrote:

Hi, thank you very much for your tutorials I have a quick question. I want to divide the tumor patients into two groups based on their mRNA expression (X gene) and then evaluate whether X gene mRNA expression level has any correlation with survival. Would you kindly tell me how to write the script? Thanks a lot : )

ADD COMMENTlink written 9 months ago by zsucllj0

See my question above. Mike gave the exact answer and related script.

ADD REPLYlink written 9 months ago by ShirleyDai20
0
gravatar for biostata
9 months ago by
biostata0
biostata0 wrote:

The error message is

clinical$IDs <- toupper(clinical$patient.bcr_patient_barcode)

Error in clinical$patient.bcr_patient_barcode : $ operator is invalid for atomic vectors

ADD COMMENTlink written 9 months ago by biostata0
0
gravatar for zsucllj
8 months ago by
zsucllj0
zsucllj0 wrote:

Hi, Should I make any changes to the above script? because I am just interested in the cancer patients but not the normal controls? Thanks!

ADD COMMENTlink written 8 months ago by zsucllj0

For this you need to further filter out samples from TCGA barcode , tumor types range from 01 - 09, normal types from 10 - 19 Control samples from 20 - 29, https://wiki.nci.nih.gov/display/TCGA/Working+with+TCGA+Data.

ADD REPLYlink written 8 months ago by Mike500
0
gravatar for zsucllj
8 months ago by
zsucllj0
zsucllj0 wrote:

Hi, Thank you very much for your help.Would you kindly tell me why the tumor types in this script are marked as 14? They should be 01-09. Thanks!

ADD COMMENTlink written 8 months ago by zsucllj0

Hi zsucllj,

in this script 14 is not tumor types/sample, this is the two digits at position 14-15 of the barcode (eg. TCGA-02-0001-01 , including "-"),

plase have a look at figure & table (https://wiki.nci.nih.gov/display/TCGA/Working+with+TCGA+Data)

ADD REPLYlink modified 8 months ago • written 8 months ago by Mike500

if you are replying to the same question please use the "add comment" link :)

ADD REPLYlink written 8 months ago by TriS2.5k

sorry TriS, I dont know whats different b/w "add comment" and "add reply". if my comments are not according to proper format , please delete them.

ADD REPLYlink modified 8 months ago • written 8 months ago by Mike500
0
gravatar for jzhou
8 months ago by
jzhou10
USA, California
jzhou10 wrote:

Hi, thank you very much for an informative tutorial! I apologize if this is an overly naive question, but I was wondering what new things could be learned from conducting your own survival analysis of TCGA data like in this tutorial when on Firehose there are already analyses of nearly every TCGA cancer data set including correlations between mRNAseq data and survival rates in their "Clinical Analysis" pages. Thanks!

ADD COMMENTlink written 8 months ago by jzhou10

When I look at correlations to survival with Firehose I find that they often list 0 genes correlating with survival. In contrast to their analysis, I have found thousands of genes that have significant p-values.

ADD REPLYlink written 7 months ago by Jordan Anaya660
0
gravatar for biobiu
8 months ago by
biobiu0
United States
biobiu0 wrote:

Hi, Great tutorial and thank you very much! I have some further comments and questions:

1) If someone is willing to get another/additional data from other sources, it is very important that the patients in "new_death", "death_event" and in the "rna event" will be in the same order in all of them (in this tutorial its unnecessary since all the data is sorted by the same way in gdac).

2) How can it be that some patients have a value which is not NA in the death_days data, but in patient.vital_status is "alive" (e.g TCGA-A3-3347, or TCGA-A3-3325), in both cases patient.follow_ups.follow_up.vital_status shows "dead" but patient.vital_status shows "alive", however it seems to me as a bug in the updated data uploaded to gdac. what do you think? Do you understand what is the difference between patient.vital_status to patient.follow_ups.follow_up.vital_status? If you will agree that this is kind of a bug, I suggest the following correction:

all_clin$death_event <- ifelse(all_clin$death_days =="NA" , 0,1)

instead of: all_clin$death_event <- ifelse(clinical$patient.vital_status == "alive", 0,1)

3) Just to make sure- 1)the reason that you got NA in the median survival is because that most of them are still alive? 2) Is the figure is up to date? because things are a bit difference (mean survival for altered is now 3554, for not altered is the same, the gdac data was upadated, do you believe that is the case?).

ADD COMMENTlink modified 8 months ago • written 8 months ago by biobiu0

I program mainly in Python so I haven't tried running the code in this thread, but a patient vital status issue was raised recently for OncoLnc: https://github.com/OmnesRes/onco_lnc/issues/1

In this repository you will find Python code for extracting clinical data from files downloaded from the old TCGA data portal which were named "clinical_follow_up" and "clinical_patient".

For OncoLnc I had to parse the clinical data for 21 different cancers, and most of the time in the files when a patient is 'dead' they have '[Not Applicable]' written in the 'last_contact_days_to' column, but in some cancers and for some patients they list a number in that column even when a patient is dead.

Given that this issue was most prevalent for GBM and OV and I believe those are two of the earlier studies performed by TCGA I can't help but wonder if TCGA changed how they report the follow_up times at some point and this resulted in an inconsistent file format.

To answer your point 3 I did notice that I would get 'NA' for median survival when I had more patients listed as 'alive' and once I fixed the code I had much fewer 'NA' values for median survival.

ADD REPLYlink written 7 months ago by Jordan Anaya660
0
gravatar for sharmi.banerji
7 months ago by
sharmi.banerji0 wrote:

Hello,

I have a question on the legend. I used this script in similar analysis except that I have a status column indication "altered" and "not altered". I was wondering if I use

# plot the data
 plot(survfit(Surv(as.numeric(as.character(all_clin$new_death))[ind_clin],all_clin$death_event[ind_clin])~event_rna[ind_gene,ind_tum]),
 col=c(1:3), frame=F, lwd=2,main=paste("KIRK",rownames(z_rna)[ind_gene],sep="\n"))

that means the color of the graph will be black for "altered" and red for "not altered" (as per order of the factors). But doesn't manually adding the legend like this

# add legend
legend=c(paste("NotAltered=",x1),paste("Altered=",x2)),bty="n",cex=1.3,lwd=3,col=c("black","red"))

swaps the legends of the two graphs? Or maybe I am missing something here.

Greatly appreciate help!

Thanks

ADD COMMENTlink written 7 months ago by sharmi.banerji0
0
gravatar for lazymaggie
6 months ago by
lazymaggie0
United States
lazymaggie0 wrote:

I feel something is wrong in your codes:

ind_tum <- which(unique(colnames(z_rna)) %in% rownames(all_clin))

ind_clin <- which(rownames(all_clin) %in% colnames(z_rna))

the patient's barcodes are not matched using the two ids:

rownames(all_clin)[ind_clin]

colnames(event_rna)[ind_tum]

both all_clin and event_rna have duplicated samples, so a solution about the above problem is using the following codes:

all_clin <- all_clin[!duplicated(rownames(all_clin)), ] # remove duplicated samples in all_clin

event_rna <- event_rna[, !duplicated(colnames(event_rna))] # remove duplicated samples in event_rna

commonId <- intersect(colnames(event_rna), rownames(all_clin)) # obtain the common samples

ind_tum <- which(colnames(event_rna) %in% commonId)

ind_clin <- which(rownames(all_clin) %in% commonId)

ADD COMMENTlink written 6 months ago by lazymaggie0

lazymaggie, thank you for pointing this out. I will check the code and edit where necessary

ADD REPLYlink written 5 months ago by TriS2.5k
0
gravatar for charco
5 months ago by
charco30
charco30 wrote:

Thanks for writing this up.

A question about the input data however. I thought that illuminahiseq_rnaseqv2-RSEM_genes_normalized was not RPKM (or more correctly FPKM for paired end data), but actually a quantile normalised raw count provided by RSEM (see here https://wiki.nci.nih.gov/display/TCGA/RNASeq+Version+2).

It would be important not to confuse the two.

ADD COMMENTlink written 5 months ago by charco30

it doesn't really matter in this case since they'll be transformed to zscores

ADD REPLYlink written 5 months ago by TriS2.5k

So which one is it?

Maybe it doesn't make a difference for your workflow, but it may do for others. For example, using voom on quantile normalised data may not be optimal. Also, if someone was performing downstream analyses that aggregated data across genes (eg gene set enrichment), you wouldn't want them to think they were using RPKM when they were actually using counts.

It is worth noting for the many people (including students) reading this thread who may go on to perform a variety of another analyses. Please consider correcting your description of the input data (if there is an error?).

ADD REPLYlink written 5 months ago by charco30

firebrowse (the file I suggested) provides you with RSEM values, not FPKM, not RPKM. for data normalization there are multiple ways of normalizing them and it depends from what you want to do. for the scope of this tutorial voom works well. if you prefer other methods DESeq2 or edgeR offer different flavors that might better fit your needs. also, bare in mind that what I described is a guide, not the only way of doing it.

ADD REPLYlink written 5 months ago by TriS2.5k

Are you sure? The file you suggested has 'normalized count' written in the header. RSEM gives transcripts per million (TPM) by default, which is not the same as a normalized count

ADD REPLYlink written 5 months ago by charco30

so, the normalized values are RSEM data divided by the 75th percentile and multiplied by 1000, as you correctly pointed out in the fist comment. to get TPM you have to multiply the RSEM estimate for 10^6 (RSEM paper) in this case, those are normalized RSEM values, not TPM, not RPKM, not FPKM. hope this helps

ADD REPLYlink written 5 months ago by TriS2.5k
0
gravatar for jzhou
5 months ago by
jzhou10
USA, California
jzhou10 wrote:

Hi, thanks again for a great tutorial. This might be a very basic question, but I was wondering if anyone could tell me what the numbers for Altered and NotAltered exactly refer to? Thanks!

ADD COMMENTlink written 5 months ago by jzhou10

number of samples with gene expression higher/lower than a specific threshold (i.e. z-score of 2) -> altered opposite -> not altered

ADD REPLYlink written 5 months ago by TriS2.5k
0
gravatar for cying
4 months ago by
cying10
China
cying10 wrote:

Hi, in regard to the "days_to_new_tumor_event_after_initial_treatment", I think the smaller value was more reasonable than the higher one. The DFS was defined as the time from initial treatment until the date of new tumor event. And the first time of new tumor event should be selected instead of the last time. This was opposite to the "days of last follow-up".

ADD COMMENTlink written 4 months ago by cying10

hi cying, thank you for spotting that, you are correct. this code went through several updates, and that's one of those I didn't implement yet.

ADD REPLYlink written 4 months ago by TriS2.5k
0
gravatar for 352182595
10 weeks ago by
3521825950
3521825950 wrote:

Hi, thanks for your great tutorial. Could you help me a question about survival analysis.In clinical, i want to updata survival function in a dynamic way but lack of multiple gene values to fit such a longitudinal model,so i want to use landmark approach to update the prognosis.However, i am not sure whether a genomic variable is a time-varying effect covariate,which means the effect of the gene might vary with time. Thank you for your patience,and hope that you could share me more detailes.

ADD COMMENTlink written 10 weeks ago by 3521825950

this question goes beyond what's described in this tutorial, you should start a new question in Biostar for that.

ADD REPLYlink written 10 weeks ago by TriS2.5k
0
gravatar for JP
7 weeks ago by
JP0
JP0 wrote:

thanks so much for the step-by-step tutorial, it has been very helpful for a biologist who is new to R and survival analysis.

I noticed that I am getting the following errors that don't allow for survfit() to function properly:

Error #1:

all_clin$new_death <- c() for (i in 1:length(as.numeric(as.character(all_clin$death_days)))){ + all_clin$new_death[i] <- ifelseis.na(as.numeric(as.character(all_clin$death_days))[i]), + as.numeric(as.character(all_clin$followUp_days))[i],as.numeric(as.character(all_clin$death_days))[i]) + }

Which gives me the following error:

In ifelseis.na(as.numeric(as.character(all_clin$death_days))[i]),  ... :

NAs introduced by coercion

I read above that NAs shoudn't be a problem with this step, so I continued. However, when I get to the survfit function I get the second error

Error #2:

> s <- survfit(Surv(as.numeric(as.character(all_clin$new_death))[ind_clin],all_clin$death_event[ind_clin])~event_rna[ind_gene,ind_tum])

Error in Surv(as.numeric(as.character(all_clin$new_death))[ind_clin], : Invalid status value, must be logical or numeric

Any help with this would be greatly appreciate, and I am sorry in advance if I don't understand your response right away, as I said, I am very new to R and survival analysis.

Thanks again for the great tutorial!

JP

PS- I am sorry if this comes up twice, I am new to biostars and still don't know the difference between answer/reply/comment.

ADD COMMENTlink modified 7 weeks ago • written 7 weeks ago by JP0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 469 users visited in the last hour