Question: GDC: Retrieving RNA-Seq data for Tumor vs. Matched normal tissue
gravatar for dsull
4.6 years ago by
dsull1.8k wrote:

TCGA has recently migrated to the Genomic Data Commons (GDC). Following this migration, many tools convenient for retrieving TCGA data, such as TCGA-Assembler, no longer apply. So, with the new GDC, I'd like to download RNA-Seq data (in bulk) for tumor samples as well as normal control samples. How might I accomplish this?

I know how to download data from the GDC, but I need to know whether a specific RNA-seq data file is coming from a tumor how to obtain its "matched normal tissue" RNA-seq data. I'd like to do this in bulk, for, say, the HCC project.

I know there's a "legacy portal" for the old TCGA data on the GDC website, but I want to use the newest GDC portal.

Thanks in advance.

gdc tcga • 7.4k views
ADD COMMENTlink modified 4.5 years ago • written 4.6 years ago by dsull1.8k

On the search page you can use the "Add cases filter" link to add sample_type as a filter, and then limit to "blood dervied normal" or "solid tissue normal", and then on the files tab, select transcriptome profiling. However there are far fewer normal samples than tumour samples. (e.g. only ~100 breast cancer samples are from normals, as opposed to ~1000 from tumor) and non of these seem to have the raw sequencing associated with them. Perhaps they haven't finished all the processing yet?

ADD REPLYlink written 4.6 years ago by i.sudbery11k

Any solution ? I'm also interested by retrieving associated normal tissue.

ADD REPLYlink written 4.6 years ago by Nicolas Rosewick9.3k

legacy portal contains the raw fastq files which maybe handy and what you are after.

ADD REPLYlink written 4.5 years ago by nwon40
gravatar for dsull
4.5 years ago by
dsull1.8k wrote:

Hi everyone,

After looking around, it seems that a few people are having similar problems. So, I contacted GDC and they responded very quickly and helped me find a solution, which I'm posting here in hopes that it well help someone in the future :)

Here's how you do it:

1) Select all the files you want to download, and get the manifest file

2) Use the GDC api (documented here: as follows: Use the "filters" parameter to get only files with files.file_id matching the UUIDs of the files you want to download (those UUIDs are the first column of the manifest file). Use the "fields" parameter, and set it to "file_id,file_name,cases.submitter_id,cases.samples.sample_type" -- this will get you the file name, the patient (i.e. cases.submitter_id), and what type of sample is it (tumor, normal, etc.)

Here's some quick code (which the person at the GDC help center was kind enough to provide):


Replace UUIDs with the UUIDs that you're interested in. With some basic shell and JSON, it should be pretty straightforward :)

ADD COMMENTlink written 4.5 years ago by dsull1.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2239 users visited in the last hour