Tutorial: Protocol To Downlad TCGA Data From GDC
31
gravatar for Shicheng Guo
2.6 years ago by
Shicheng Guo7.4k
Shicheng Guo7.4k wrote:

Now that TCGA moved under Genomic data commons (GDC), Almost all the prevous user are struggling to retrive the same information. This tutorial try to show how to download TCGA data from GDC

Step 1. Obtaining a Manifest File for Data Download (manifest is use to specify type of the data to download)

https://gdc-portal.nci.nih.gov/legacy-archive/search/f

Step 2. Install download software: GDC Data Transfer Tool (Linux, Windows, MACS)

https://gdc.nci.nih.gov/access-data/gdc-data-transfer-tool

Step 3.1 Downloading Data Using a Manifest File (gdc_manifest.lungCancer.txt)

gdc-client download -m gdc_manifest.lungCancer.txt

Step 3.2 Downloading Single Data Using a UUID (UUID can be found in manifest file)

gdc-client download 22a29915-6712-4f7a-8dba-985ae9a1f005

Step 3.3 Downloading Controlled Data (user authentication token is required)

gdc-client download -m gdc_manifest_controled.txt -t
gdc-user-passwdcode.txt

FQA:

1, ./gdc-client: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /tmp/_MEI5oSpPi/libz.so.1)

Answer: glibc 2.12 is the latest that's available for CentOS 6. that means CentOS cannot used to download the data(UCSD, TSCC).

2, How to download controlled data from GDC

3, Eventually, I asked TSCC manager to help me install fastq-dump in TSCC

4, Download failed happened sometimes since the internet problem, but don't worry, just try again

ADD COMMENTlink modified 3 months ago by ATpoint13k • written 2.6 years ago by Shicheng Guo7.4k
3

Thanks for sharing, Could you please give some more detail about:

  1. How to extract different data types, (expression, methylation, clinical etc.). Is manifest file is same for all data types?
  2. Is it possible to download expreesion matrix (for all samples in a single file) with TCGA-tumor-ID, instead of UUIDs.
ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Mike1.1k
1

I downloaded expression data for TCGA-ESCA, there are 164 cases for this cancer, but there are 519 files with extension *.FPKM.txt.gz , *.FPKM-UQ.txt.gz & *.htseq.counts.gz. how to map with Tumor-ID/ Aliquot)id and make expression matrix (total tumor sample * all genes)?

ADD REPLYlink written 2.6 years ago by Mike1.1k

I had a similar issue with mapping all file_id's to one case_id.

This page GDC API Getting_Started indicated that I can "expand" a section for the "cases" endpoint and voiala I got the case_id <===> file_id mapping:

Example (find all files available for case_id = 31bd8589-378c-40e5-8b7f-3b4c81f304be) :

curl -s 'https://gdc-api.nci.nih.gov/cases/31bd8589-378c-40e5-8b7f-3b4c81f304be?pretty=true&expand=files' | grep -E 'file_id|file_name' | paste -d " "  - -

        "file_name": "323800b5-c319-4fd8-ac96-87193afb93e4.FPKM.txt.gz",          "file_id": "e400f345-b273-4cfc-9a1e-d1fff79f5eee",
        "file_name": "3b600545-75cb-42df-ad6d-3b5c977ff7d5.vep.reheader.vcf.gz",          "file_id": "3b600545-75cb-42df-ad6d-3b5c977ff7d5",
        "file_name": "e5b0c8fa-2b7e-4140-87d9-a5046490a08b.snp.Somatic.hc.vcf.gz",          "file_id": "e5b0c8fa-2b7e-4140-87d9-a5046490a08b",
        "file_name": "60c334bb-d579-4cf3-9fd0-e450c3e652d8.vep.reheader.vcf.gz",          "file_id": "60c334bb-d579-4cf3-9fd0-e450c3e652d8",
        "file_name": "c6b1fb77-8102-42bb-bdc0-a48270b7be9f.vcf.gz",          "file_id": "c6b1fb77-8102-42bb-bdc0-a48270b7be9f",
        "file_name": "mirnas.quantification.txt",          "file_id": "440b3abb-63e1-4a67-9708-31ee19081ec7",
        "file_name": "TCGA.READ.mutect.c49c62e7-dec8-4b77-9ba5-88d196c8ae94.protected.maf.gz",          "file_id": "c49c62e7-dec8-4b77-9ba5-88d196c8ae94",
        "file_name": "nationwidechildrens.org_clinical.TCGA-AG-A026.xml",          "file_id": "1fda4f40-ad4e-4b91-9379-c61b611769ee",
ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by indera0

1, manifest is formed by what you want to download. that means it is same with what you selected in the first stage, not same for all the data types. Finally, manifest is formed by what you selected. (add to cart in GDC website, means it was selected)

2, No, you can not download the data maqtrix. you need download them all and then merge them by perl, R, python or C

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Shicheng Guo7.4k

Thanks @Shicheng Guo;

Why there are three different types of fiiles: *.FPKM.txt.gz , *.FPKM-UQ.txt.gz & *.htseq.counts.gz., I have 519 directories for 164 cases, so how to merge them. they should be (164 * 3= 492). And how to match UUID to TCGA-sample-ID.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Mike1.1k

These are three different files:

Fragment Count (HT-Seq) ——> Gene Count ——>Count Normalization —-> FPKM ——>Upper Quantile Normalization ——>FPKM-UQ

https://gdc.nci.nih.gov/about-data/data-harmonization-and-generation/genomic-data-harmonization/high-level-data-generation/rna-seq-quantification

but how to map UUIDs with TCGA-patient.bar.code ID

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Mike1.1k

Hi! I had the same UUIDs to TCGA ID problem. I solved it using R to write a JSON sentence that is the used in the command line

Here i wrote a post about it

Hope it is usefull!

ADD REPLYlink written 2.3 years ago by martinguerrerog89260
1

I really think something needs to done about the gdc-client tool. I cannot install it on Mac OS X sierra... Downloaded the tool more than five time, unpacked, double click and all I get is the same error as given below:

Musalulas-MacBook-Pro:~ sinkala$ /Users/sinkala/Downloads/gdc-client ; exit; usage: gdc-client [-h] [--version] {download,upload,interactive} ... gdc-client: error: too few arguments logout Saving session... ...copying shared history... ...saving history...truncating history files... ...completed.

[Process completed]

I have also tried the alternate ways of installing the thing, but I have not been successful either. I have tried to download the data directly from the data portal; even that does not work for a file size less than 400mb - the server does not respond or something like that. :( :(

ADD REPLYlink written 18 months ago by smsinks10
1

The gdc-client is a command-line tool. You cannot just double-click on it. See https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Data_Download_and_Upload/. If you have other problems, feel free to contact the gdc support staff: support@nci-gdc.datacommons.io.

ADD REPLYlink written 14 months ago by Sean Davis25k

Hi, I am new to BioStars so apologies for any syntax errors.

So after trying to start the gdc-client.exe application it presents with the following error (then disappears):

usage: gdc-client [-h] [--version] {download,upload,interactive} 
gdc-client: error: too few arguments

How to solve:

You need to run the program using command line; the hideous interface summoned by typing 'cmd' into the start menu. You need to first set a Path to the folder which contains the unzipped "gdc-client.exe" file.

Here is a guide:https://www.wikihow.com/Run-a-Program-on-Command-Prompt

After doing this you can start using commands for the gdc-client program. For example, type in the "gdc-client download" command followed by "-m" for manifest, then the file location:

gdc-client download -m  /Users/JohnDoe/Downloads/gdc_manifest_6746fe840d924cf623b4634b5ec6c630bd4c06b5.txt

If you don't know how to make a manifest go here: http://www.andrewjanowczyk.com/download-tcga-digital-pathology-images-ffpe/

Finally

If you start getting error messages about there being 'no such file or directory' try dragging the manifest file into the same folder your gdc-client.exe application is in, then simply type in the command followed by the manifest file name:

gdc-client download -m gdc_manifest_20181207_182951.txt

Now that the file is in the folder your path is set to it doesn't need the location specified. It will then start downloading (hopefully)!

Enjoy

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by b.simpson0

Install download software: GDC Data Transfer Tool (Linux)

Please someone there.. could you helpe me.. I'm having trouble intalling gdc-client on ubuntu 14 I downloaded the zip gdc and extracted after that, on the shell i wrote.. ./ gdc-client.. but nothing happen,,

ADD REPLYlink written 2.5 years ago by reimco20

You can always contact support@nci-gdc.datacommons.io for support.

ADD REPLYlink written 14 months ago by Sean Davis25k

Try using chmod to change the permissions before executing the file.

ADD REPLYlink written 10 months ago by priyankamaripuri40

I'm using gdc-client v1.2.0. I specifically sort my manifest file by patient id so I may download tumor-normal pair BAMs one after another. But in reality, BAMs were downloaded in some random order, which is not from the top to bottom of my sorted manifest file. Do other people have the same problem? Is there a way to fix it?

ADD REPLYlink written 24 months ago by CHANG40

Hi, I have download GDC client tool to download files from GDC. As the download folder should contain data or zipped data and logs folder. My files are downloaded successfully. However, I see only few logs folder. For example for Bladder Urothelial carcinoma (BLCA) manifest files includes 433 UUID. But only 53 logs folder were found. Thus could you let me know is the download files are accurate?

Thank you.

ADD REPLYlink written 15 months ago by sahu.divya7860

Does someone know how to fix this error?

 92% [##########################ERROR: Max retries exceeded.:02:27  16.23 MB/s 
ERROR: Max retries exceeded.
ERROR: Max retries exceeded.
ERROR: Max retries exceeded.
ADD REPLYlink written 3 months ago by Shixiang30

Internet problem? tried several times and then failed. I guess it is the internet problem. or check the quota of the hard-disk

ADD REPLYlink written 3 months ago by Shicheng Guo7.4k
8
gravatar for Sean Davis
15 months ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

If you are looking for a flexible programmatic approach, you might take a look at the GenomicDataCommons Bioconductor package: https://bioconductor.org/packages/GenomicDataCommons

find data

The following code builds a manifest that can be used to guide the download of raw data. Here, filtering finds gene expression files quantified as raw counts using HTSeq from ovarian cancer patients.

library(GenomicDataCommons)
library(magrittr)
ge_manifest = files() %>% 
    filter( ~ cases.project.project_id == 'TCGA-OV' &
                type == 'gene_expression' &
                analysis.workflow_type == 'HTSeq - Counts') %>%
    manifest()

Download data

The next code block downloads the 379 gene expression files specified in the query above. Using multiple processes to do the download very significantly speeds up the transfer in many cases. On a standard 1Gb connection, the following completes in about 30 seconds.

destdir = tempdir()
fnames = lapply(ge_manifest$id,gdcdata,
                destination_dir=destdir,overwrite=TRUE,
                progress=FALSE)

If the download had included controlled-access data, the download above would have needed to include a token.

ADD COMMENTlink written 15 months ago by Sean Davis25k

Sean, for recent requests of access to the data, it seems that users are forwarded here: https://dcc.icgc.org/

From there, approved users can obtain an access token but it seems to not cover all data. Most importantly, it doesn't cover the mirror where TCGA data is hosted (GDC Chicago). How does one actually obtain a GDC access token? A lot of the programs and services appear to have been shut.

ADD REPLYlink written 10 months ago by Kevin Blighe37k
1

The ICGC is not the right place to get access to TCGA controlled-access data, as you point out. To gain access to controlled-access TCGA data, one needs to apply through dbGaP. The process is documented here:

https://gdc.cancer.gov/access-data/obtaining-access-controlled-data

After approval for controlled-access data, you can login to the GDC data portal to get your access token (the download link will be under your username after logging in).

ADD REPLYlink written 10 months ago by Sean Davis25k

Thanks Sean - that's what I expected. We are currently awaiting dbGaP approval.

ADD REPLYlink modified 11 days ago • written 10 months ago by Kevin Blighe37k
4
gravatar for Chun-Jie Liu
2.2 years ago by
Chun-Jie Liu260
US, Houston
Chun-Jie Liu260 wrote:

For the CentOS, you need to download the gdc-client source code to compile yourself.

gdc-client github issued this problem that glibc 2.12 is the latest that's available for CentOS 6.

If your system is CentOS release 6.6, I think you should download the gdc-client source code and compile it yourself. gdc-client is based on the py2.

  1. git clone https://github.com/NCI-GDC/gdc-client
  2. python setup.py install

You may meet the problem

The 'lxml==3.5.0b1' distribution was not found and is required by gdc-client

or

ImportError: /usr/lib64/libxml2.so.2: version `LIBXML2_2.9.0' not found (required by lxml/etree.so)

You need to install libxslt and libxml2 in your home path. And add xml2-config and xslt-config to your path. export PATH="/prog_path/libxslt-1.1.29/bin:/prog_path/libxml2-2.9.4/bin:$PATH"

Then

  1. pip uninstall lxml
  2. pip install lxml==3.5.0b1 --install-option="--auto-rpath"

Finnaly, compile gdc-client source code.

  1. python setup.py install

It worked.

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Chun-Jie Liu260
3
gravatar for Shicheng Guo
2.6 years ago by
Shicheng Guo7.4k
Shicheng Guo7.4k wrote:

Take Bladder cancer as example:

1, Go the following link (legacy-archive at GDC):

https://gdc-portal.nci.nih.gov/legacy-archive/search/f?filters=%7B%22op%22:%22and%22,%22content%22:%5B%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22cases.project.program.name%22,%22value%22:%5B%22TCGA%22%5D%7D%7D,%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22cases.project.project_id%22,%22value%22:%5B%22TCGA-BLCA%22%5D%7D%7D,%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22files.platform%22,%22value%22:%5B%22Illumina%20Human%20Methylation%20450%22%5D%7D%7D,%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22files.data_format%22,%22value%22:%5B%22TXT%22%5D%7D%7D,%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22files.data_category%22,%22value%22:%5B%22DNA%20methylation%22%5D%7D%7D%5D%7D

2, Add all 440 files to cart and download Manifest file

3, You will see the first and second column of the Manifest file is UUID and Sample ID

enter image description here enter image description here

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by Shicheng Guo7.4k

Hi Shicheng

Thanks for the detailed view. One more clarification: When I try to download the WGS (whole genome sequencing data) for, say Breast cancer (TCGA-BRCA) from GDC Legacy, the second column of the manifest file for the same has some ids which are not sample IDs. What are those? e.g

01aa8d222c93eac50081544889046aeb.bam 01e2ea9ed2554ea6df56ed963414b511.bam

etc. If these are also samples then how to retrieve their corresponding TCGA ids?

Thanks in advance.

ADD REPLYlink written 2.2 years ago by aanchalsharma8330

@aanchalsharma833
GDC provides an API, and you can get info by retrieving from GDC_API. I write a simple script on my GitHub to map file_id to TCGA barcode (submitter_id in GDC). The TCGA barcode is supposed to provide sample info, script extracts both sample type and TCGA barcode.

Input is the manifest file you downloaded from GDC. The output file is mapped file which title is generated automatically by GDC_API. Hope it useful.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Chun-Jie Liu260
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1603 users visited in the last hour