I'm using TCGA gene expression data. At some part of my work I need to do survival analysis . I wonder to know that, is there any way to get some information from TCGA to do survival analysis of the sample which I have gene expression of them?
I'm currently in the middle of something similar - the TCGA Bioinformatics team very kindly helped me out.
If you want to get the raw data yourself, it is in the "Clinical" data. These can be downloaded as text or XML - I've mostly looked at the XML files. I believe there is normally one file for the patient, and one file for every sample taken. (Normally there's just one sample, obtained at time of surgery.)
The problem is that dates in the clinical data, such as date of death, have been redacted to preserve patient privacy. I think that all dates have been replaced with values giving the number of days since original diagnosis.
If you just want to do a survival curve, you are looking for the number under the XML tag "days_to_death".
The day the particular sample was taken is under "days_to_sample_procurement" (i.e. number of days between diagnosis and sample procurement). I think you could find other useful numbers by just doing a find for "days_to".
Hope this helps,
It's easy to fetch those data with R.
TCGA-Assembler is a very good tool for you to get those data easily.
On the assumption that you are familar with R.
First, download this tools, and unpackage it.
Third, excute the next sentence.
DownloadClinicalData(traverseResultFile = "./DirectoryTraverseResult_Jul-08-2014.rda", saveFolderName ="./UserManualExampleData/RawData.TCGA-Assembler", cancerType = "BLCA", clinicalDataType = c("patient", "drug", "follow_up", "radiation")) saveFolderName ="./UserManualExampleData/RawData.TCGA-Assembler" #set the dir cancerType = "BLCA" #choose the cancer type clinicalDataType = c("patient", "drug", "follow_up", "radiation")) #choose the type of the linical data you want to download
if you just want get the data for suvival analysis, you can just choose "follow_up", as choose the "days_to_death" and "days_to_last_follow_up" columns in the file as the death and censored data for survival analysis.
Or you just can get the clinical data for this weblink,
I have strong opinion against using TCGA data for survival analysis, please correct me if I am wrong.
If you check days_to_death, or days_to_last_contact, you would found days as early as 2000 days ago, way before TCGA even started. My suspicion is that these were patient from other programs, and they were diagnosed before TCGA project. If I am correct on this, there is a huge bias here that only live person were later recruited to TCGA, while the dead ones from these legacy programs were hidden and never show up in TCGA. I guess the majority people who used TCGA data for analysis never thought about this.
So these dates need to be adjust to the TCGA dates, by subtracting either days_to_collection or days_to_procuration of the samples. The new problem here is the second is almost all empty, while the first dates is about 80% empty. This means, by starting with a 500 patient project, you get about 400 with either available days_to_death or days_to_last_contact, and ran down to less than 100 with days_to_collection. This number is not enough of any kind of survival comparisons by say biomarker, clinical categories, or etc.
Try Synapse platform (need to register but you can access with a google account).
For example, here you can find survival data for Lung Squamous Cell Carcinoma.
even if a lil late...you can analyze survival by using the example here
that's the main part about overall survival (in ovarian caner) but it also has links on how to build the dataset and build your own analysis for your preferred tumor type
This should be the easiest way, you can also select the datasets from PROGgene or you can upload your own datasets. FYI: It also has datasets from TCGA.
You can also check previous posts explaining how to download Clinical data from TCGA.