Tutorial: Protocol To Downlad TCGA Data From GDC
gravatar for Shicheng Guo
4.3 years ago by
Shicheng Guo8.4k
Shicheng Guo8.4k wrote:

Now that TCGA moved under Genomic data commons (GDC), Almost all the prevous user are struggling to retrive the same information. This tutorial try to show how to download TCGA data from GDC

Step 1. Obtaining a Manifest File for Data Download (manifest is use to specify type of the data to download)


Step 2. Install download software: GDC Data Transfer Tool (Linux, Windows, MACS)


Step 3.1 Downloading Data Using a Manifest File (gdc_manifest.lungCancer.txt)

gdc-client download -m gdc_manifest.lungCancer.txt

Step 3.2 Downloading Single Data Using a UUID (UUID can be found in manifest file)

gdc-client download 22a29915-6712-4f7a-8dba-985ae9a1f005

Step 3.3 Downloading Controlled Data (user authentication token is required)

gdc-client download -m gdc_manifest_controled.txt -t


1, ./gdc-client: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /tmp/_MEI5oSpPi/libz.so.1)

Answer: glibc 2.12 is the latest that's available for CentOS 6. that means CentOS cannot used to download the data(UCSD, TSCC).

2, How to download controlled data from GDC

3, Eventually, I asked TSCC manager to help me install fastq-dump in TSCC

4, Download failed happened sometimes since the internet problem, but don't worry, just try again

ADD COMMENTlink modified 2.1 years ago by ATpoint41k • written 4.3 years ago by Shicheng Guo8.4k

Thanks for sharing, Could you please give some more detail about:

  1. How to extract different data types, (expression, methylation, clinical etc.). Is manifest file is same for all data types?
  2. Is it possible to download expreesion matrix (for all samples in a single file) with TCGA-tumor-ID, instead of UUIDs.
ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by Mike1.6k

I downloaded expression data for TCGA-ESCA, there are 164 cases for this cancer, but there are 519 files with extension *.FPKM.txt.gz , *.FPKM-UQ.txt.gz & *.htseq.counts.gz. how to map with Tumor-ID/ Aliquot)id and make expression matrix (total tumor sample * all genes)?

ADD REPLYlink written 4.3 years ago by Mike1.6k

I had a similar issue with mapping all file_id's to one case_id.

This page GDC API Getting_Started indicated that I can "expand" a section for the "cases" endpoint and voiala I got the case_id <===> file_id mapping:

Example (find all files available for case_id = 31bd8589-378c-40e5-8b7f-3b4c81f304be) :

curl -s 'https://gdc-api.nci.nih.gov/cases/31bd8589-378c-40e5-8b7f-3b4c81f304be?pretty=true&expand=files' | grep -E 'file_id|file_name' | paste -d " "  - -

        "file_name": "323800b5-c319-4fd8-ac96-87193afb93e4.FPKM.txt.gz",          "file_id": "e400f345-b273-4cfc-9a1e-d1fff79f5eee",
        "file_name": "3b600545-75cb-42df-ad6d-3b5c977ff7d5.vep.reheader.vcf.gz",          "file_id": "3b600545-75cb-42df-ad6d-3b5c977ff7d5",
        "file_name": "e5b0c8fa-2b7e-4140-87d9-a5046490a08b.snp.Somatic.hc.vcf.gz",          "file_id": "e5b0c8fa-2b7e-4140-87d9-a5046490a08b",
        "file_name": "60c334bb-d579-4cf3-9fd0-e450c3e652d8.vep.reheader.vcf.gz",          "file_id": "60c334bb-d579-4cf3-9fd0-e450c3e652d8",
        "file_name": "c6b1fb77-8102-42bb-bdc0-a48270b7be9f.vcf.gz",          "file_id": "c6b1fb77-8102-42bb-bdc0-a48270b7be9f",
        "file_name": "mirnas.quantification.txt",          "file_id": "440b3abb-63e1-4a67-9708-31ee19081ec7",
        "file_name": "TCGA.READ.mutect.c49c62e7-dec8-4b77-9ba5-88d196c8ae94.protected.maf.gz",          "file_id": "c49c62e7-dec8-4b77-9ba5-88d196c8ae94",
        "file_name": "nationwidechildrens.org_clinical.TCGA-AG-A026.xml",          "file_id": "1fda4f40-ad4e-4b91-9379-c61b611769ee",
ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by indera0

1, manifest is formed by what you want to download. that means it is same with what you selected in the first stage, not same for all the data types. Finally, manifest is formed by what you selected. (add to cart in GDC website, means it was selected)

2, No, you can not download the data maqtrix. you need download them all and then merge them by perl, R, python or C

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by Shicheng Guo8.4k

Thanks @Shicheng Guo;

Why there are three different types of fiiles: *.FPKM.txt.gz , *.FPKM-UQ.txt.gz & *.htseq.counts.gz., I have 519 directories for 164 cases, so how to merge them. they should be (164 * 3= 492). And how to match UUID to TCGA-sample-ID.

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by Mike1.6k

These are three different files:

Fragment Count (HT-Seq) ——> Gene Count ——>Count Normalization —-> FPKM ——>Upper Quantile Normalization ——>FPKM-UQ


but how to map UUIDs with TCGA-patient.bar.code ID

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by Mike1.6k

Hi! I had the same UUIDs to TCGA ID problem. I solved it using R to write a JSON sentence that is the used in the command line

Here i wrote a post about it

Hope it is usefull!

ADD REPLYlink written 4.1 years ago by martinguerrerog89300

I really think something needs to done about the gdc-client tool. I cannot install it on Mac OS X sierra... Downloaded the tool more than five time, unpacked, double click and all I get is the same error as given below:

Musalulas-MacBook-Pro:~ sinkala$ /Users/sinkala/Downloads/gdc-client ; exit; usage: gdc-client [-h] [--version] {download,upload,interactive} ... gdc-client: error: too few arguments logout Saving session... ...copying shared history... ...saving history...truncating history files... ...completed.

[Process completed]

I have also tried the alternate ways of installing the thing, but I have not been successful either. I have tried to download the data directly from the data portal; even that does not work for a file size less than 400mb - the server does not respond or something like that. :( :(

ADD REPLYlink written 3.3 years ago by smsinks20

The gdc-client is a command-line tool. You cannot just double-click on it. See https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Data_Download_and_Upload/. If you have other problems, feel free to contact the gdc support staff: support@nci-gdc.datacommons.io.

ADD REPLYlink written 3.0 years ago by Sean Davis26k

Hi, I am new to BioStars so apologies for any syntax errors.

So after trying to start the gdc-client.exe application it presents with the following error (then disappears):

usage: gdc-client [-h] [--version] {download,upload,interactive} 
gdc-client: error: too few arguments

How to solve:

You need to run the program using command line; the hideous interface summoned by typing 'cmd' into the start menu. You need to first set a Path to the folder which contains the unzipped "gdc-client.exe" file.

Here is a guide:https://www.wikihow.com/Run-a-Program-on-Command-Prompt

After doing this you can start using commands for the gdc-client program. For example, type in the "gdc-client download" command followed by "-m" for manifest, then the file location:

gdc-client download -m  /Users/JohnDoe/Downloads/gdc_manifest_6746fe840d924cf623b4634b5ec6c630bd4c06b5.txt

If you don't know how to make a manifest go here: http://www.andrewjanowczyk.com/download-tcga-digital-pathology-images-ffpe/


If you start getting error messages about there being 'no such file or directory' try dragging the manifest file into the same folder your gdc-client.exe application is in, then simply type in the command followed by the manifest file name:

gdc-client download -m gdc_manifest_20181207_182951.txt

Now that the file is in the folder your path is set to it doesn't need the location specified. It will then start downloading (hopefully)!


ADD REPLYlink modified 23 months ago • written 23 months ago by b.simpson0

Install download software: GDC Data Transfer Tool (Linux)

Please someone there.. could you helpe me.. I'm having trouble intalling gdc-client on ubuntu 14 I downloaded the zip gdc and extracted after that, on the shell i wrote.. ./ gdc-client.. but nothing happen,,

ADD REPLYlink written 4.3 years ago by reimco20

You can always contact support@nci-gdc.datacommons.io for support.

ADD REPLYlink written 3.0 years ago by Sean Davis26k

Try using chmod to change the permissions before executing the file.

ADD REPLYlink written 2.6 years ago by priyankamaripuri40

I'm using gdc-client v1.2.0. I specifically sort my manifest file by patient id so I may download tumor-normal pair BAMs one after another. But in reality, BAMs were downloaded in some random order, which is not from the top to bottom of my sorted manifest file. Do other people have the same problem? Is there a way to fix it?

ADD REPLYlink written 3.8 years ago by CHANG40

Hi, I have download GDC client tool to download files from GDC. As the download folder should contain data or zipped data and logs folder. My files are downloaded successfully. However, I see only few logs folder. For example for Bladder Urothelial carcinoma (BLCA) manifest files includes 433 UUID. But only 53 logs folder were found. Thus could you let me know is the download files are accurate?

Thank you.

ADD REPLYlink written 3.0 years ago by sahu.divya7860

Does someone know how to fix this error?

 92% [##########################ERROR: Max retries exceeded.:02:27  16.23 MB/s 
ERROR: Max retries exceeded.
ERROR: Max retries exceeded.
ERROR: Max retries exceeded.
ADD REPLYlink written 2.1 years ago by Shixiang70

Internet problem? tried several times and then failed. I guess it is the internet problem. or check the quota of the hard-disk

ADD REPLYlink written 2.1 years ago by Shicheng Guo8.4k
gravatar for Sean Davis
3.0 years ago by
Sean Davis26k
National Institutes of Health, Bethesda, MD
Sean Davis26k wrote:

If you are looking for a flexible programmatic approach, you might take a look at the GenomicDataCommons Bioconductor package: https://bioconductor.org/packages/GenomicDataCommons

find data

The following code builds a manifest that can be used to guide the download of raw data. Here, filtering finds gene expression files quantified as raw counts using HTSeq from ovarian cancer patients.

ge_manifest = files() %>% 
    filter( ~ cases.project.project_id == 'TCGA-OV' &
                type == 'gene_expression' &
                analysis.workflow_type == 'HTSeq - Counts') %>%

Download data

The next code block downloads the 379 gene expression files specified in the query above. Using multiple processes to do the download very significantly speeds up the transfer in many cases. On a standard 1Gb connection, the following completes in about 30 seconds.

destdir = tempdir()
fnames = lapply(ge_manifest$id,gdcdata,

If the download had included controlled-access data, the download above would have needed to include a token.

ADD COMMENTlink written 3.0 years ago by Sean Davis26k

Sean, for recent requests of access to the data, it seems that users are forwarded here: https://dcc.icgc.org/

From there, approved users can obtain an access token but it seems to not cover all data. Most importantly, it doesn't cover the mirror where TCGA data is hosted (GDC Chicago). How does one actually obtain a GDC access token? A lot of the programs and services appear to have been shut.

ADD REPLYlink written 2.6 years ago by Kevin Blighe67k

The ICGC is not the right place to get access to TCGA controlled-access data, as you point out. To gain access to controlled-access TCGA data, one needs to apply through dbGaP. The process is documented here:


After approval for controlled-access data, you can login to the GDC data portal to get your access token (the download link will be under your username after logging in).

ADD REPLYlink written 2.6 years ago by Sean Davis26k

Thanks Sean - that's what I expected. We are currently awaiting dbGaP approval.

ADD REPLYlink modified 21 months ago • written 2.6 years ago by Kevin Blighe67k
gravatar for Chun-Jie Liu
3.9 years ago by
Chun-Jie Liu280
US, Houston
Chun-Jie Liu280 wrote:

For the CentOS, you need to download the gdc-client source code to compile yourself.

gdc-client github issued this problem that glibc 2.12 is the latest that's available for CentOS 6.

If your system is CentOS release 6.6, I think you should download the gdc-client source code and compile it yourself. gdc-client is based on the py2.

  1. git clone https://github.com/NCI-GDC/gdc-client
  2. python setup.py install

You may meet the problem

The 'lxml==3.5.0b1' distribution was not found and is required by gdc-client


ImportError: /usr/lib64/libxml2.so.2: version `LIBXML2_2.9.0' not found (required by lxml/etree.so)

You need to install libxslt and libxml2 in your home path. And add xml2-config and xslt-config to your path. export PATH="/prog_path/libxslt-1.1.29/bin:/prog_path/libxml2-2.9.4/bin:$PATH"


  1. pip uninstall lxml
  2. pip install lxml==3.5.0b1 --install-option="--auto-rpath"

Finnaly, compile gdc-client source code.

  1. python setup.py install

It worked.

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by Chun-Jie Liu280
gravatar for Shicheng Guo
4.3 years ago by
Shicheng Guo8.4k
Shicheng Guo8.4k wrote:

Take Bladder cancer as example:

1, Go the following link (legacy-archive at GDC):


2, Add all 440 files to cart and download Manifest file

3, You will see the first and second column of the Manifest file is UUID and Sample ID

enter image description here enter image description here

ADD COMMENTlink modified 4.3 years ago • written 4.3 years ago by Shicheng Guo8.4k

Hi Shicheng

Thanks for the detailed view. One more clarification: When I try to download the WGS (whole genome sequencing data) for, say Breast cancer (TCGA-BRCA) from GDC Legacy, the second column of the manifest file for the same has some ids which are not sample IDs. What are those? e.g

01aa8d222c93eac50081544889046aeb.bam 01e2ea9ed2554ea6df56ed963414b511.bam

etc. If these are also samples then how to retrieve their corresponding TCGA ids?

Thanks in advance.

ADD REPLYlink written 4.0 years ago by aanchalsharma8330

GDC provides an API, and you can get info by retrieving from GDC_API. I write a simple script on my GitHub to map file_id to TCGA barcode (submitter_id in GDC). The TCGA barcode is supposed to provide sample info, script extracts both sample type and TCGA barcode.

Input is the manifest file you downloaded from GDC. The output file is mapped file which title is generated automatically by GDC_API. Hope it useful.

ADD REPLYlink modified 3.9 years ago • written 4.0 years ago by Chun-Jie Liu280
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1899 users visited in the last hour