Survival Analysis Using Tcga Data
8
3
Entering edit mode
8.7 years ago
jack ▴ 490

I'm using TCGA gene expression data. At some part of my work I need to do survival analysis . I wonder to know that, is there any way to get some information from TCGA to do survival analysis of the sample which I have gene expression of them?

tcga bioinformatician • 35k views
11
Entering edit mode
8.7 years ago
dirigible2012 ▴ 320

I'm currently in the middle of something similar - the TCGA Bioinformatics team very kindly helped me out.

If you want to get the raw data yourself, it is in the "Clinical" data. These can be downloaded as text or XML - I've mostly looked at the XML files. I believe there is normally one file for the patient, and one file for every sample taken. (Normally there's just one sample, obtained at time of surgery.)

The problem is that dates in the clinical data, such as date of death, have been redacted to preserve patient privacy. I think that all dates have been replaced with values giving the number of days since original diagnosis.

If you just want to do a survival curve, you are looking for the number under the XML tag "days_to_death".

The day the particular sample was taken is under "days_to_sample_procurement" (i.e. number of days between diagnosis and sample procurement). I think you could find other useful numbers by just doing a find for "days_to".

Hope this helps,

Stephanie

0
Entering edit mode

Thank, but which value I should take it out. I've looked at XML file of it and I found the line with tage days_to_death it's like this :

<shared:days_to_death precision="day" xsd_ver="1.12" tier="1" cde="3165475" owner="TSS" procurement_status="Not Applicable"/>
<shared:days_to_last_followup precision="day" xsd_ver="1.12" tier="1" cde="3008273" owner="TSS"

0
Entering edit mode

That's interesting. I presume the XML file works like an HTML file, so you want the value in between the two tags. (I've replaced the angle brackets with square because Biostar is interpreting them as HTML.)

e.g. (tags shortened a bit)

[shared: days_to_death] VALUE [/shared: days_to_death]


I've had a look at an example file, and it looks to me like if there is a missing value the file contains the start tag but not the end tag. In this case, you are missing the days_to_death, which suggests the patient is still alive.

If you look at the example below, the days_to_death value is also missing, but the vital status is "Alive" and there is a value for days to last followup.

[shared:vital_status xsd_ver="2.6" restricted="false" procurement_status="Completed" owner="TSS" cde="5" display_order="25" preferred_name="vital_status" tier="2" source_system_identifier="492461"] Alive[/shared:vital_status]
[shared:days_to_last_followup xsd_ver="1.12" procurement_status="Completed" owner="TSS" cde="3008273" tier="1" precision="day"] 389[/shared:days_to_last_followup]
[shared:days_to_last_known_alive xsd_ver="2.1" procurement_status="Not Available" owner="TSS" cde="" tier="2" precision="day"/]
[shared:days_to_death xsd_ver="1.12" procurement_status="Not Applicable" owner="TSS" cde="3165475" tier="1" precision="day"/]

0
Entering edit mode

Thanks, but what is xsd_ver="1.12"?

0
Entering edit mode

hey dirigible2012 & Stephanie, is there a file that explains about the xml tags for the clinical data? I am also doing the survival analysis and I am looking at the xml files, they seem to be really large and convoluted. its taking time to understand them, I was wondering if there is some guide for the xml tag description, then I can parse out the necessary information.. I might need other clinical data as well in future.

thanks so much

0
Entering edit mode

We (at SolveBio) have actually gone through the individual clinical patient information files for each TCGA cancer type and parsed out some of this information. See https://www.solvebio.com/library/TCGA/1.2.0-2015-02-11/PatientInformation for more information about the data and this ipython notebook for an example of how to access the data (SolveBio is free for academics/noncommercial-use, so sign up and try it out). It was kind of a mess but I think we've done a decent job. ICGC is a quite a bit easier to work with and includes a lot of TCGA.

6
Entering edit mode
7.8 years ago
Miao Yu ▴ 80

It's easy to fetch those data with R.

TCGA-Assembler is a very good tool for you to get those data easily.

On the assumption that you are familar with R.

Second,

source("Model_A.R")


Third, excute the next sentence.

DownloadClinicalData(traverseResultFile = "./DirectoryTraverseResult_Jul-08-2014.rda", saveFolderName ="./UserManualExampleData/RawData.TCGA-Assembler", cancerType = "BLCA", clinicalDataType = c("patient", "drug", "follow_up", "radiation"))
saveFolderName ="./UserManualExampleData/RawData.TCGA-Assembler" #set the dir
cancerType = "BLCA" #choose the cancer type
clinicalDataType = c("patient", "drug", "follow_up", "radiation")) #choose the type of the linical data you want to download


If you just want get the data for survival analysis, you can just choose follow_up, as choose the days_to_death and days_to_last_follow_up columns in the file as the death and censored data for survival analysis.

Or you just can get the clinical data for this weblink

good luck~

6
Entering edit mode
7.7 years ago
Zhenyu Zhang ▴ 690

I have strong opinion against using TCGA data for survival analysis, please correct me if I am wrong.

If you check days_to_death, or days_to_last_contact, you would found days as early as 2000 days ago, way before TCGA even started. My suspicion is that these were patient from other programs, and they were diagnosed before TCGA project. If I am correct on this, there is a huge bias here that only live person were later recruited to TCGA, while the dead ones from these legacy programs were hidden and never show up in TCGA. I guess the majority people who used TCGA data for analysis never thought about this.

So these dates need to be adjust to the TCGA dates, by subtracting either days_to_collection or days_to_procuration of the samples. The new problem here is the second is almost all empty, while the first dates is about 80% empty. This means, by starting with a 500 patient project, you get about 400 with either available days_to_death or days_to_last_contact, and ran down to less than 100 with days_to_collection. This number is not enough of any kind of survival comparisons by say biomarker, clinical categories, or etc.

1
Entering edit mode
8.7 years ago
Chip ▴ 130

Try Synapse platform (need to register but you can access with a google account).

https://www.synapse.org/#!Synapse:syn300013

For example, here you can find survival data for Lung Squamous Cell Carcinoma.

https://www.synapse.org/#!Synapse:syn1446127/version/3

0
Entering edit mode
8.0 years ago
TriS ★ 4.6k

Even if a little late...you can analyze survival by using the example here

http://bioinformatics.mdanderson.org/Supplements/ResidualDisease/Reports/osCurves.html

That's the main part about overall survival (in ovarian caner) but it also has links on how to build the dataset and build your own analysis for your preferred tumor type

0
Entering edit mode
7.8 years ago
EagleEye 7.4k

This should be the easiest way, you can also select the datasets from PROGgene or you can upload your own datasets. FYI: It also has datasets from TCGA.

http://watson.compbio.iupui.edu/chirayu/proggene/database/?url=proggene

You can also check previous posts explaining how to download Clinical data from TCGA.

Clinical Survival data of TCGA

0
Entering edit mode
5.8 years ago
JP • 0

accidentally posted in wrong comment section, sorry!

0
Entering edit mode
5.5 years ago
xushutan ▴ 40

A website for Breast cancer survival curve in different subtypes: luminal A, luminal B, Basal, Her2 and Normal-like. http://tumorsurvival.org/