Guide to TCGA data
Hi, I've been working with TCGA data for a couple years now and I am the creator of OncoLnc. In this guide I'll help you navigate this complicated resource.
Do you need to download data?
It's possible that an online tool will be sufficient for what you are looking for.
cBioPortal is by far the most comprehensive interactive tool for analyzing TCGA data.
Some useful features of cBioPortal:
- Getting a quick, easy to view summary of each cancer study. This includes overall survival of the patients, demographics, overall mutation and CNA counts
- Identifying which genes are most heavily mutated in each cancer or have undergone copy number alterations
- Identifying genes most highly co-expressed to your gene and the co-occurrence of mutations and CNAs
- Exploring protein and phosphoprotein levels
- Survival analysis (disease free and time to death) with either mutations, CNAs, or expression (microarray or RNA-SEQ)
Some limitations of cBioPortal:
- Expression is listed as z-scores instead of the raw values
- The Onco Query Language only allows for comparison of the altered group versus unaltered group (this prevents you from comparing highest expressing patients to lowest expressing patients)
- The miRNA data in cBioPortal suffers from the fact that the TCGA Tier 3 annotations are out of date: expression is for stem-loop sequence instead of mature miRNAs
- Only contains data for tumors, cannot perform a comparison of expression in normal tissue versus tumor
A tool focusing on DNA methylation: MEXPRESS
This tool allows for:
- Excellent visualization with a UCSC genome browser type of layout that includes multiple clinical features along with RNA-SEQ expression and methylation probe values
Single click statistical correlations between features such as:
- Sample type (normal or tumor) and expression
- Pathologic stage and lymphocyte infiltration
- BRCA PAM50 subtype and expression
MEXPRESS can be a little bit of a data overload at first, but once you see that the data gets sorted by whatever clinical feature you click on it is very fun to use. Unfortunately MEXPRESS does not allow for survival analysis.
A tool focusing on survival analysis: OncoLnc
Some facts about OncoLnc:
- Only contains expression (RNA-SEQ) and survival data (time to death or last follow-up)
- Displays the results for up to 21 survival analyses at a time
- Allows for interactive Kaplan-Meier plot generation and download of the exact clinical and expression data used for the plot
- Contains updated miRNA definitions and includes MiTranscriptome beta lncRNAs
- OncoLnc does not contain any data for normal tissue
In addition to these tools for interactive analysis of Tier 3 TCGA data, some recent efforts have been made to reanalyze the TCGA data with a focus on lncRNAs.
Used an in-house assembly method to identify transcripts, and have made some of their data available for browsing and download.
Contains read counts for ensembl defined lncRNAs, but also allows users to define their own lncRNA by inputting genomic coordinates. TANRIC also includes various analyses including survival analyses and allows for download of their data.
There isn't an available tool for what you need >:(
Okay, so you've looked through the data portals above but they just aren't cutting it, you need to get your hands on some juicy data.
The next question you need to ask yourself is will Tier 3 data suffice (available to anyone), or do you need Tier 1 data (requires permission and some serious computing resources and technical know-how).
Tier 1 data includes BAM or unaligned files for:
- whole genome sequencing
- RNA sequencing
- small RNA sequencing (only BAM)
- Bisulfite sequencing
- ChiP sequencing
Tier 3 data includes clinical data and processed files for:
- small RNA-SEQ
- protein (RPPA)
- SNPs and mutations
Let's say you think Tier 3 data will suffice, there are several methods for downloading Tier 3 data.
You can go straight to the source: https://tcga-data.nci.nih.gov/tcga/
The benefits of this approach are that you know the data will be as up-to-date as possible, and you don't have to worry about a third party introducing any errors.
Downsides to this approach are that you are put in a queue to download the data, and when downloading expression data you have to download ALL the files, which results in you downloading much more than you probably wanted or your computer can handle. If you want to download say all RNA-SEQ rsem.normalized files, you will also have to download the unnormalized files as well. And if you want to do this for every cancer you can easily spend a whole day just queuing and downloading.
The primary alternative for downloading Tier 3 data is http://gdac.broadinstitute.org/
Firehose contains minimally processed files, basically they take the files from https://tcga-data.nci.nih.gov/tcga/ and merge them into one file with a R friendly data structure. For example, all 1000 BRCA rsem.normalized files would be merged into one file.
There is no queue for downloading, and because you are not downloading data you don't want the download is much faster. Firehose is typically where I go for MAF files: interesting note, there doesn't seem to be a pan-cancer MAF file anywhere and a lot of people want one, someone with free time should get on this.
I do not use Firehose for clinical data because the clinical data requires a complex merge of several files from https://tcga-data.nci.nih.gov/tcga/.
There are a lot of different clinical files, but the ones you are likely interested in are the ones that contain survival data. Yes, you read that correctly, ones as in plural.
There are two types of files that contain survival information, the "clinical_follow_up" files, and the "clinical_patient" file. And yes, I used the plural again.
The "clinical_follow_up" file with the largest version number typically contains the most survival data and the most recent data. If you are looking for information such as grade or smoking status, you want the "clinical_patient" file, but this file also contains survival data, which may or may not be the same as the data in the "clinical_follow_up" file or files.
For example, BRCA contains these four files with nonredundant clinical data:
To make matters worse, within a "clinical_follow_up" file a patient can be listed multiple times. Luckily the files appear to be sorted by patient identifier, and the most recent data for each patient is the last one listed.
Given this complexity, I do not know if Firehose is merging these files perfectly. I'm not saying the data in Firehose is incorrect, it just maybe isn't as perfect as it could be (I have not gone through to check the accuracy of their parsing).
But I am pretty sure I'm extracting all the data possible with my code available at https://github.com/OmnesRes/onco_lnc
Other download methods
Firehose and cBioPortal both offer APIs for downloading data. This may be the best method to get the expression level of your gene for every patient in every cancer study or get all observed mutations across cancers.
OncoLnc allows you to download expression data coupled to survival data one cancer, one gene at a time. Note: patients can have multiple expression files, and these are averaged in OncoLnc.
You downloaded your data, now what the hell are these names >:/
Here is an example of a patient barcode: TCGA-C5-A2LY-01A-31R-A18M-07
If you want to know what every part of the barcode is you can check out https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode
But you most likely will only be interested in the start of the barcode: TCGA-C5-A2LY-01A
The TCGA-C5-A2LY will identify the patient and this is what will be in the clinical file. The "01" tells you what type of sample you are dealing with. You can find a list of codes here: https://tcga-data.nci.nih.gov/datareports/codeTablesReport.htm?codeTable=Sample%20type
You will likely be focused on samples that have "01" or "11". It is important to note that LAML samples will be designated "03" since that is a blood-derived cancer, and SKCM has a lot of metastatic samples so you be dealing with a lot of "06" in that case.
But my files are named unc.edu.0046fe...!?!?!?!?!?!?
Ah yes, that is quite annoying, but nothing a Python dictionary won't solve ;)
When you download data from https://tcga-data.nci.nih.gov/tcga/ you also get a FILE_SAMPLE_MAP file which maps the patient barcodes to the files you downloaded. So a single patient in your clinical file might have an expression file for their normal tissue sample, one or more tumor samples, and maybe even a recurrent tumor or metastatic tumor.
Umm, are these gene names correct????
This is one of the main problems with Tier 3 data. The same pipeline with the same gene annotations is used in every cancer study, including newer cancer studies. And the first cancer study was like a long time ago, making these gene annotations really old.
As a result, if you are studying a lncRNA it probably won't be in TCGA Tier 3 data, and you need to check out MiTranscriptome or TANRIC.
Enough with this weak sauce, I need that Tier 1 data >:)
Okay, no problem, if you are associated with a university you can ask your advisor to apply for access.
Instructions are at https://cghub.ucsc.edu/access/get_access.html
Once you have the key file that you need, download GeneTorrent. This program has multiple dependencies, but I've seen it easily installed on a Mac and an Ubuntu server.
Once you've got that installed (preferably on a server with a good internet connection and a ton of memory and compute power), head over to https://browser.cghub.ucsc.edu.
Select the samples you want and add them to your cart (that's right, we're going shopping, and everything is free!). You will want to download the manifest file and the tsv file.
With your manifest file run this command on your server: gtdownload -v --max-children 1 -d manifest.xml -c cghub.key #If you have a bunch of cores you can increase max-children
Download speed is fast, but for each file it takes some time to connect. So even if you are downloading a bunch of small files it might have to run overnight (make sure at the checkout you see how much data you are downloading so you don't fill your disk!).
With your TSV file you can map the analysis ids of the files you just downloaded to the patient barcodes.
I changed my mind, I'm not as hardcore as I thought :'(
Yeah...working with Tier 1 data is a huge pain. Luckily people have realized this and there are some pilot programs for analyzing TCGA data in the cloud that don't require downloading the raw files (but you will still need to save your processed files, which could be just as large or larger than the raw data). I haven't used any of these services yet so I can't recommend one or provide advice.
And that should get you started analyzing TCGA data. Anything important I missed?