Question: Survival Analysis Using Tcga Data
gravatar for jack
6.8 years ago by
jack470 wrote:

I'm using TCGA gene expression data. At some part of my work I need to do survival analysis . I wonder to know that, is there any way to get some information from TCGA to do survival analysis of the sample which I have gene expression of them?

tcga bioinformatician • 30k views
ADD COMMENTlink modified 3.6 years ago by xushutan40 • written 6.8 years ago by jack470
gravatar for dirigible2012
6.8 years ago by
European Union
dirigible2012320 wrote:

I'm currently in the middle of something similar - the TCGA Bioinformatics team very kindly helped me out.

If you want to get the raw data yourself, it is in the "Clinical" data. These can be downloaded as text or XML - I've mostly looked at the XML files. I believe there is normally one file for the patient, and one file for every sample taken. (Normally there's just one sample, obtained at time of surgery.)

The problem is that dates in the clinical data, such as date of death, have been redacted to preserve patient privacy. I think that all dates have been replaced with values giving the number of days since original diagnosis.

If you just want to do a survival curve, you are looking for the number under the XML tag "days_to_death".

The day the particular sample was taken is under "days_to_sample_procurement" (i.e. number of days between diagnosis and sample procurement). I think you could find other useful numbers by just doing a find for "days_to".

Hope this helps,


ADD COMMENTlink written 6.8 years ago by dirigible2012320

Thank, but which value I should take it out. I've looked at XML file of it and I found the line with tage days_to_death it's like this :

<shared:days_to_death precision="day" xsd_ver="1.12" tier="1" cde="3165475" owner="TSS" procurement_status="Not Applicable"/>
<shared:days_to_last_followup precision="day" xsd_ver="1.12" tier="1" cde="3008273" owner="TSS"
ADD REPLYlink modified 12 months ago by _r_am32k • written 6.8 years ago by jack470

That's interesting. I presume the XML file works like an HTML file, so you want the value in between the two tags. (I've replaced the angle brackets with square because Biostar is interpreting them as HTML.)

e.g. (tags shortened a bit)

[shared: days_to_death] VALUE [/shared: days_to_death]

I've had a look at an example file, and it looks to me like if there is a missing value the file contains the start tag but not the end tag. In this case, you are missing the days_to_death, which suggests the patient is still alive.

If you look at the example below, the days_to_death value is also missing, but the vital status is "Alive" and there is a value for days to last followup.

[shared:vital_status xsd_ver="2.6" restricted="false" procurement_status="Completed" owner="TSS" cde="5" display_order="25" preferred_name="vital_status" tier="2" source_system_identifier="492461"] Alive[/shared:vital_status]
[shared:days_to_last_followup xsd_ver="1.12" procurement_status="Completed" owner="TSS" cde="3008273" tier="1" precision="day"] 389[/shared:days_to_last_followup]
[shared:days_to_last_known_alive xsd_ver="2.1" procurement_status="Not Available" owner="TSS" cde="" tier="2" precision="day"/]
[shared:days_to_death xsd_ver="1.12" procurement_status="Not Applicable" owner="TSS" cde="3165475" tier="1" precision="day"/]
ADD REPLYlink modified 12 months ago by _r_am32k • written 6.8 years ago by dirigible2012320

Thanks, but what is xsd_ver="1.12"?

ADD REPLYlink modified 12 months ago by _r_am32k • written 6.8 years ago by jack470

hey dirigible2012 & Stephanie, is there a file that explains about the xml tags for the clinical data? I am also doing the survival analysis and I am looking at the xml files, they seem to be really large and convoluted. its taking time to understand them, I was wondering if there is some guide for the xml tag description, then I can parse out the necessary information.. I might need other clinical data as well in future.

thanks so much

ADD REPLYlink modified 12 months ago by _r_am32k • written 6.7 years ago by srividyanathan20060

We (at SolveBio) have actually gone through the individual clinical patient information files for each TCGA cancer type and parsed out some of this information. See for more information about the data and this ipython notebook for an example of how to access the data (SolveBio is free for academics/noncommercial-use, so sign up and try it out). It was kind of a mess but I think we've done a decent job. ICGC is a quite a bit easier to work with and includes a lot of TCGA. 

ADD REPLYlink written 5.7 years ago by dandan350
gravatar for Miao Yu
6.0 years ago by
Miao Yu80
Miao Yu80 wrote:

It's easy to fetch those data with R.

TCGA-Assembler is a very good tool for you to get those data easily.

On the assumption that you are familar with R.

First, download this tools, and unpackage it.



Third, excute the next sentence.

DownloadClinicalData(traverseResultFile = "./DirectoryTraverseResult_Jul-08-2014.rda", saveFolderName ="./UserManualExampleData/RawData.TCGA-Assembler", cancerType = "BLCA", clinicalDataType = c("patient", "drug", "follow_up", "radiation"))
saveFolderName ="./UserManualExampleData/RawData.TCGA-Assembler" #set the dir
cancerType = "BLCA" #choose the cancer type
clinicalDataType = c("patient", "drug", "follow_up", "radiation")) #choose the type of the linical data you want to download

if you just want get the data for suvival analysis, you can just choose follow_up, as choose the days_to_death and days_to_last_follow_up columns in the file as the death and censored data for survival analysis.

Or you just can get the clinical data for this weblink,

good luck~

ADD COMMENTlink modified 12 months ago by _r_am32k • written 6.0 years ago by Miao Yu80
gravatar for Zhenyu Zhang
5.8 years ago by
Zhenyu Zhang270
United States
Zhenyu Zhang270 wrote:

I have strong opinion against using TCGA data for survival analysis, please correct me if I am wrong. 

If you check days_to_death, or days_to_last_contact, you would found days as early as 2000 days ago, way before TCGA even started.  My suspicion is that these were patient from other programs, and they were diagnosed before TCGA project.  If I am correct on this, there is a huge bias here that only live person were later recruited to TCGA, while the dead ones from these legacy programs were hidden and never show up in TCGA.  I guess the majority people who used TCGA data for analysis never thought about this.  

So these dates need to be adjust to the TCGA dates, by subtracting either days_to_collection or days_to_procuration of the samples.  The new problem here is the second is almost all empty, while the first dates is about 80% empty.  This means, by starting with a 500 patient project, you get about 400 with either available days_to_death or days_to_last_contact, and ran down to less than 100 with days_to_collection.  This number is not enough of any kind of survival comparisons by say biomarker, clinical categories, or etc.  

ADD COMMENTlink written 5.8 years ago by Zhenyu Zhang270
gravatar for Chip
6.8 years ago by
Chip110 wrote:

Try Synapse platform (need to register but you can access with a google account).!Synapse:syn300013

For example, here you can find survival data for Lung Squamous Cell Carcinoma.!Synapse:syn1446127/version/3

ADD COMMENTlink written 6.8 years ago by Chip110
gravatar for TriS
6.2 years ago by
United States, Buffalo
TriS4.3k wrote:

even if a lil can analyze survival by using the example here

that's the main part about overall survival (in ovarian caner) but it also has links on how to build the dataset and build your own analysis for your preferred tumor type

ADD COMMENTlink written 6.2 years ago by TriS4.3k
gravatar for EagleEye
6.0 years ago by
EagleEye6.7k wrote:

This should be the easiest way, you can also select the datasets from PROGgene or you can upload your own datasets. FYI: It also has datasets from TCGA.


You can also check previous posts explaining how to download Clinical data from TCGA.

A: Clinical Survival data of TCGA


ADD COMMENTlink modified 6.0 years ago • written 6.0 years ago by EagleEye6.7k
gravatar for JP
4.0 years ago by
JP0 wrote:

accidentally posted in wrong comment section, sorry!

ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by JP0
gravatar for xushutan
3.6 years ago by
xushutan40 wrote:

A website for Breast cancer survival curve in different subtypes: luminal A, luminal B, Basal, Her2 and Normal-like.

ADD COMMENTlink written 3.6 years ago by xushutan40
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2097 users visited in the last hour