Is it possible to download cancer, normal, and matched tissue data from GDC? From same patient samples? How?
1
0
Entering edit mode
7.5 years ago
fr ▴ 210

I'm trying to download gene expression data from GDC. Ideally, I'd like to get normal, cancer, and matched data for the same patients.

How to identify whether a certain gene expression data is for cancer or matched data? (I am aware that some TCGA barcodes used to specify this info).

Is it easy to do this in Matlab / Perl, or is GDC Data Transfer Tool the best way?

Thanks in advance

gdc • 3.8k views
ADD COMMENT
5
Entering edit mode
7.5 years ago
Bill Wysocki ▴ 100

Hello,

The best way to match tumor and normal samples to a specific patient is through the GDC API. The API can be used to output a tab-delimited (TSV) file that associates file name, file id (UUID), sample type (tumor / normal), and patient ID (TCGA barcode).

The general query structure will look like this:

curl "https://gdc-api.nci.nih.gov/files?size=100000&format=tsv&filters=XXXX&fields=YYYY"


filters= :

The XXXX will be replaced with a URL-encoded JSON query that will filter a list of files. The JSON query can be generated yourself and then URL encoded, or the filtering can be performed on the GDC Data Portal (https://gdc-portal.nci.nih.gov/search/s?facetTab=files). You can copy and paste the string that appears where the * are displayed in the following:

https://gdc-portal.nci.nih.gov/search/f?filters=**********&facetTab=cases

(Note: sometimes the URL-encoded query goes all the way to the end of the URL, so copy up until the & or the end)


fields= :

The YYYY will be replaced with a comma-separated list of fields that will be generated as columns in your TSV. To get the aforementioned fields you'll need:

file_name : the name of the file

file_id : the UUID of the file

cases.samples.sample_type : the tumor/ normal status of the sample

cases.submitter_id : the patient TCGA barcode

So it will be organized like this: file_name,file_id,cases.samples.sample_type,cases.submitter_id

More fields can be added to this and are listed here: https://gdc-docs.nci.nih.gov/API/Users_Guide/Appendix_A_Available_Fields/#file-fields

For more information about the API, see the API Users Guide here: https://gdc-docs.nci.nih.gov/API/Users_Guide/Getting_Started/


Data Transfer Tool

Once you get the TSV file containing the files that you are looking for and their metadata, you can download them using the GDC Data Transfer Tool (DTT). The UUID column in the TSV can be used as a manifest by extracting it into a plain text file with the header "id" at the top. Then run the DTT with the command "gdc-client download -m <manifest_file_name>".

The DTT can be downloaded here: https://gdc.cancer.gov/access-data/gdc-data-transfer-tool

and you can read about its functionalities here: https://gdc-docs.nci.nih.gov/Data_Transfer_Tool/Users_Guide/Getting_Started/


I hope this helps, follow up on this thread if you have (or anyone else has) any questions about the use of the GDC API or DTT.

Best,

-Bill - GDC User Services Team

ADD COMMENT
0
Entering edit mode

Thank you so much, I was already able to generate the sort of metadata file that I needed and to download many files of interest. I would like to ask one more thing though: could you provide some more input as to where to generate the filters through JSON? For instance, if I want to retrieve gene expression data (FPKMs) for different cancer types, one by one, how may I do it?

I am aware that one way to do it is to select all these cancer types from the GDC Data Portal, and then batch download them the same way as described.

Thanks again for your previous input and thanks in advance!

ADD REPLY

Login before adding your answer.

Traffic: 2567 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6