Question: How to download triple negative breast cancer RNA-seq fpkm data from GDC.
gravatar for mtkk94
24 months ago by
mtkk9410 wrote:

Hello, This might be a silly question but I couldn't find a tag to select triple negative breast cancer samples in GDC. I've tried downloading metadata of RNA-seq files and also clinical metadata but couldn't find a tag. I've also downloaded a single clinical file and tried searching for a tag but no luck. I've even searched for any posts for this problem but couldn't find any. Is there anyway I can download triple negative data?

Normally, I try to download data using GDC tool. Is there any other which might help me downloading TCGA data specific to triple negative breast cancer data?

Please help me with this. Thank in advance!

sequencing rna-seq • 3.5k views
ADD COMMENTlink modified 20 months ago • written 24 months ago by mtkk9410
gravatar for Kevin Blighe
24 months ago by
Kevin Blighe49k
Kevin Blighe49k wrote:

Yes, the TCGA can be difficult to navigate.

In the clinical data, the TCGA doesn't specifically have a field for triple negative status, so, you have to infer it from the patient clinical data. The best place to download the clinical data in flat-file format is actually on the GDC Legacy Archive.

1, Determine samples that are TNBC (triple-negative breast cancer)

  • Under Cases -> Primary Site, select 'Breast'
  • Under Cases -> Project, select 'TCGA-BRCA'
  • Then make other selections based on your interest (for example, male or female breast cancer?; breast cancer in specific ethnic groups?)
  • Under Files -> Data Category, select 'Clinical'
  • Under Files -> Data Format, select 'Biotab'
  • Download the file called nationwidechildrens.org_clinical_patient_brca.txt
  • In this file, look for the columns 'er_status_by_ihc', 'pr_status_by_ihc', 'her2_status_by_ihc', which will allow you to identify the sample UUIDs and TCGA barcodes that relate to triple-negative breast cancer (TNBC)

2, Obtain sample manifest

To then download actual RNA-seq data, you can stay on the GDC Legacy Archive under the Files tab, and then make further selections to choose the type of data you want. RNA-seq is usually available in multiple formats, including RSEM or HTSeq raw / estimated counts. Once you have selected samples that you want from the checkboxes, click 'Download Manifest'.

3, Download data using GDC Data Transfer Tool and sample manifest

Download the GDC Data Transfer Tool and execute it with your sample manifest. This will download your data to the directory from which you ran the executable. Command is gdc-client download -m Manifest.txt

4, Integrate the clinical data with the expression data

All I'll say here is to be very meticulous. You have to work with multiple ID types and it may take a while to get your head around it.


Edit (12th May 2018):

For efficient mapping of UUIDs to TCGA barcodes for the purposes of distinguishing tumour from normal, see here: C: Sample names for TCGA data from GDC-legacy archive

ADD COMMENTlink modified 17 months ago • written 24 months ago by Kevin Blighe49k

Thank you for the solution. I've downloaded the clinical file and it's been difficult to map the IDs from clinical to that of expression matrix. Fortunately, I know how to code and mapped them back. Also, the clinical file I've obtained doesn't have any information regarding to control/normal samples.

ADD REPLYlink written 19 months ago by mtkk9410

Yes, determining tumour versus normal can be difficult. You should search for a particular file name, UUID, or something else using the Data Portal and then infeer tumour or normal via the TCGA barcode.

In the clinical data that you've downloaded, most likely each record (row) relates to a patient, which itself relates to multiple tumour biopsies and [most likely] a normal.

It is cumbersome working with this data - I know

ADD REPLYlink written 19 months ago by Kevin Blighe49k

Not sure if you are still watching this thread, mtkk94, but I have edited the final part of my answer with new information on a very rapid way to map UUIDs to TCGA Barcodes, which can then be used to distinguish tumor from normal.

ADD REPLYlink written 17 months ago by Kevin Blighe49k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1837 users visited in the last hour