Question: GDC: Retrieving RNA-Seq data for Tumor vs. Matched normal tissue
6
gravatar for dsull
2.0 years ago by
dsull120
dsull120 wrote:

TCGA has recently migrated to the Genomic Data Commons (GDC). Following this migration, many tools convenient for retrieving TCGA data, such as TCGA-Assembler, no longer apply. So, with the new GDC, I'd like to download RNA-Seq data (in bulk) for tumor samples as well as normal control samples. How might I accomplish this?

I know how to download data from the GDC, but I need to know whether a specific RNA-seq data file is coming from a tumor how to obtain its "matched normal tissue" RNA-seq data. I'd like to do this in bulk, for, say, the HCC project.

I know there's a "legacy portal" for the old TCGA data on the GDC website, but I want to use the newest GDC portal.

Thanks in advance.

gdc tcga • 4.0k views
ADD COMMENTlink modified 2.0 years ago • written 2.0 years ago by dsull120
2

On the search page you can use the "Add cases filter" link to add sample_type as a filter, and then limit to "blood dervied normal" or "solid tissue normal", and then on the files tab, select transcriptome profiling. However there are far fewer normal samples than tumour samples. (e.g. only ~100 breast cancer samples are from normals, as opposed to ~1000 from tumor) and non of these seem to have the raw sequencing associated with them. Perhaps they haven't finished all the processing yet?

ADD REPLYlink written 2.0 years ago by i.sudbery2.4k

Any solution ? I'm also interested by retrieving associated normal tissue.

ADD REPLYlink written 2.0 years ago by Nicolas Rosewick6.5k

legacy portal contains the raw fastq files which maybe handy and what you are after.

ADD REPLYlink written 23 months ago by nwon20
7
gravatar for dsull
2.0 years ago by
dsull120
dsull120 wrote:

Hi everyone,

After looking around, it seems that a few people are having similar problems. So, I contacted GDC and they responded very quickly and helped me find a solution, which I'm posting here in hopes that it well help someone in the future :)

Here's how you do it:

1) Select all the files you want to download, and get the manifest file

2) Use the GDC api (documented here: https://gdc-docs.nci.nih.gov/) as follows: Use the "filters" parameter to get only files with files.file_id matching the UUIDs of the files you want to download (those UUIDs are the first column of the manifest file). Use the "fields" parameter, and set it to "file_id,file_name,cases.submitter_id,cases.samples.sample_type" -- this will get you the file name, the patient (i.e. cases.submitter_id), and what type of sample is it (tumor, normal, etc.)

Here's some quick code (which the person at the GDC help center was kind enough to provide):

 {
    "filters":{
        "op":"in",
        "content":{
            "field":"files.file_id",
            "value":[
                "9c5e4668-5ed8-4d7e-972a-6e985b0030df",
                "629fb40f-e97d-4bb8-8f59-0a71256211b1"
            ]
        }
    },
    "format":"TSV",
    "fields":"file_id,file_name,cases.submitter_id,cases.samples.sample_type",
    "size":"100000"
}

Replace UUIDs with the UUIDs that you're interested in. With some basic shell and JSON, it should be pretty straightforward :)

ADD COMMENTlink written 2.0 years ago by dsull120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1561 users visited in the last hour