Question

How to retrieve metadata from the manifest data and UUID from genomic data commons (GDC)

0

Entering edit mode

4.8 years ago

a.james ▴ 240

Hello All,

I have exon datasets (aligned BAM) files downloaded. I now need the metadata information for the same samples. How can I download or extract them given the manifest information of all samples.

I have read through the GDC API, however, I am not clear how could get the metadata as tsv file.

I saw the following shell script using curl.

curl --request POST --header "Content-Type: application/json" --data @Payload.txt 'https://api.gdc.cancer.gov/files' > File_metadata.txt

However, I dont understand where to give a list or set of UUID from the maifest file.

My questions:

I Have UUIDs in my manifest files for all samples or datset I have downloaded, now I need metadata file in tsv format.
How can download it? is there a shell or python script for the same ?

Any help/suggestions are appreciated !

tcga exon mutations alignment next-gen • 2.7k views

ADD COMMENT • link 4.8 years ago by a.james ▴ 240

1

Entering edit mode

What information in metadata exactly that you're looking for? Hope the following codes could provide some ideas about this (make sure to have jq installed):

UUIDs=("d853e541-f16a-4345-9f00-88e03c2dc0bc 74e522c6-0aad-4b9e-8d65-fe7b6da10046")

for UUID in ${UUIDs}; do
  curl -s https://api.gdc.cancer.gov/files/${UUID} \
    | jq -r '.data | "\(.data_type)\t\(.file_name)\t\(.data_format)\t\(.data_category)\t\(.experimental_strategy)"'
done

Which returns:

Aligned Reads   0017ba4c33a07ba807b29140b0662cb1_gdc_realn.bam  BAM Sequencing Reads    WXS
Gene Expression Quantification  2d9744c1-0b8e-48e2-a4a5-0bbc7a637bbf.FPKM.txt.gz    TXT Transcriptome Profiling RNA-Seq

Other metadata could be (vary from UUID to UUID):

{
  "data": {
    "data_release": "12.0 - 18.0",
    "data_type": "Aligned Reads",
    "updated_datetime": "2019-05-17T23:21:18.237724+00:00",
    "created_datetime": "2016-05-26T17:06:40.003624-05:00",
    "file_name": "0017ba4c33a07ba807b29140b0662cb1_gdc_realn.bam",
    "md5sum": "a08304b120c5df76b6532da0e9a35ced",
    "data_format": "BAM",
    "acl": [
      "phs000178"
    ],
    "access": "controlled",
    "platform": "Illumina",
    "state": "released",
    "version": "1",
    "file_id": "d853e541-f16a-4345-9f00-88e03c2dc0bc",
    "data_category": "Sequencing Reads",
    "file_size": 23650901931,
    "submitter_id": "c30188d7-be1a-4b43-9a17-e19ccd71792e",
    "type": "aligned_reads",
    "experimental_strategy": "WXS"
  },
  "warnings": {}
}

ADD REPLY • link 4.8 years ago by AK ★ 2.2k

0

Entering edit mode

Thank you I need the following information in the metadata file,

study   tcga_id     analysis_id  refassembly     mark_duplicates    exome_bed

However, I I can have them also in from BAM header, but was looking for a programmatic way, way.

ADD REPLY • link 4.8 years ago by a.james ▴ 240

0

Entering edit mode

This thread continued here: C: Sample names for TCGA data from GDC-legacy archive