GDC: Retrieving RNA-Seq data for Tumor vs. Matched normal tissue
1
7
Entering edit mode
5.9 years ago
dsull ★ 3.1k

TCGA has recently migrated to the Genomic Data Commons (GDC). Following this migration, many tools convenient for retrieving TCGA data, such as TCGA-Assembler, no longer apply. So, with the new GDC, I'd like to download RNA-Seq data (in bulk) for tumor samples as well as normal control samples. How might I accomplish this?

I know how to download data from the GDC, but I need to know whether a specific RNA-seq data file is coming from a tumor how to obtain its "matched normal tissue" RNA-seq data. I'd like to do this in bulk, for, say, the HCC project.

I know there's a "legacy portal" for the old TCGA data on the GDC website, but I want to use the newest GDC portal.

gdc tcga • 8.4k views
2
Entering edit mode

On the search page you can use the "Add cases filter" link to add sample_type as a filter, and then limit to "blood dervied normal" or "solid tissue normal", and then on the files tab, select transcriptome profiling. However there are far fewer normal samples than tumour samples. (e.g. only ~100 breast cancer samples are from normals, as opposed to ~1000 from tumor) and non of these seem to have the raw sequencing associated with them. Perhaps they haven't finished all the processing yet?

0
Entering edit mode

Any solution ? I'm also interested by retrieving associated normal tissue.

0
Entering edit mode

legacy portal contains the raw fastq files which maybe handy and what you are after.

7
Entering edit mode
5.9 years ago
dsull ★ 3.1k

Hi everyone,

After looking around, it seems that a few people are having similar problems. So, I contacted GDC and they responded very quickly and helped me find a solution, which I'm posting here in hopes that it well help someone in the future :)

Here's how you do it:

1) Select all the files you want to download, and get the manifest file

2) Use the GDC api (documented here: https://gdc-docs.nci.nih.gov/) as follows: Use the "filters" parameter to get only files with files.file_id matching the UUIDs of the files you want to download (those UUIDs are the first column of the manifest file). Use the "fields" parameter, and set it to "file_id,file_name,cases.submitter_id,cases.samples.sample_type" -- this will get you the file name, the patient (i.e. cases.submitter_id), and what type of sample is it (tumor, normal, etc.)

Here's some quick code (which the person at the GDC help center was kind enough to provide):

 {
"filters":{
"op":"in",
"content":{
"field":"files.file_id",
"value":[
"9c5e4668-5ed8-4d7e-972a-6e985b0030df",
"629fb40f-e97d-4bb8-8f59-0a71256211b1"
]
}
},
"format":"TSV",
"fields":"file_id,file_name,cases.submitter_id,cases.samples.sample_type",
"size":"100000"
}


Replace UUIDs with the UUIDs that you're interested in. With some basic shell and JSON, it should be pretty straightforward :)