17 months ago by
Republic of Ireland
Yes, the TCGA can be difficult to navigate.
In the clinical data, the TCGA doesn't specifically have a field for triple negative status, so, you have to infer it from the patient clinical data. The best place to download the clinical data in flat-file format is actually on the GDC Legacy Archive.
1, Determine samples that are TNBC (triple-negative breast cancer)
- Under Cases -> Primary Site, select 'Breast'
- Under Cases -> Project, select 'TCGA-BRCA'
- Then make other selections based on your interest (for example, male
or female breast cancer?; breast cancer in specific ethnic groups?)
- Under Files -> Data Category, select 'Clinical'
- Under Files -> Data Format, select 'Biotab'
- Download the file called
- In this file, look for the columns 'er_status_by_ihc', 'pr_status_by_ihc', 'her2_status_by_ihc', which will allow you to identify the sample UUIDs and TCGA barcodes that relate to triple-negative breast cancer (TNBC)
2, Obtain sample manifest
To then download actual RNA-seq data, you can stay on the GDC Legacy Archive under the Files tab, and then make further selections to choose the type of data you want. RNA-seq is usually available in multiple formats, including RSEM or HTSeq raw / estimated counts. Once you have selected samples that you want from the checkboxes, click 'Download Manifest'.
3, Download data using GDC Data Transfer Tool and sample manifest
Download the GDC Data Transfer Tool and execute it with your sample manifest. This will download your data to the directory from which you ran the executable.
gdc-client download -m Manifest.txt
4, Integrate the clinical data with the expression data
All I'll say here is to be very meticulous. You have to work with multiple ID types and it may take a while to get your head around it.
Edit (12th May 2018):
For efficient mapping of UUIDs to TCGA barcodes for the purposes of distinguishing tumour from normal, see here: C: Sample names for TCGA data from GDC-legacy archive