Question

Converting UUID downloaded from GDC to TCGA names

2

Entering edit mode

5.9 years ago

emiliano.traini ▴ 20

Hi!

I have downloaded files regarding breast cancer genome transcription with the following python code:

import requests
import json
import re
url = "https://api.gdc.cancer.gov/files"
filters = {
    "op": "and",
    "content":[
        {
        "op": "in",
        "content":{
            "field": "cases.primary_site",
            "value": ["Breast"]
            }
        },
        {
        "op": "in",
        "content":{
            "field": "files.analysis.workflow_type",
            "value": ["HTSeq - FPKM-UQ"]
            }
        },
        {
        "op": "in",
        "content":{
            "field": "files.data_category",
            "value": ["Transcriptome Profiling"]
            }
        }
    ]
}

params = {
    "filters" : json.dumps(filters), # prende un oggetto (filters) e return stringa
    "fields" : "file_id",
    "format" : "JSON",
    "size" : "2000"
    }

r = requests.get(url, params = params)
file_uuid_list = []
for file_entry in json.loads(r.content.decode("utf-8"))["data"]["hits"]:
    file_uuid_list.append(file_entry["file_id"])

url_data = "https://api.gdc.cancer.gov/data"

params = {"ids": file_uuid_list}

response = requests.post(url_data, data = json.dumps(params), headers = {"Content-Type": "application/json"})

response_head_cd = response.headers["Content-Disposition"]

file_name = re.findall("filename=(.+)", response_head_cd)[0]

with open(file_name, "wb") as output_file:
    output_file.write(response.content)

I can't manage to find the TCGA names of the downloaded files. I have tried to modify the code with the following one:

params = {
    "filters" : json.dumps(filters), # prende un oggetto (filters) e return stringa
    "fields" : "file_id, cases.submitter_id, cases.case_id",
    "format" : "JSON",
    "size" : "2000"
    }

but it doesn't work because maybe the URL is wrong (I have connection error in request.get())

gdc python • 3.7k views

ADD COMMENT • link updated 11 months ago by Ram 43k • written 5.9 years ago by emiliano.traini ▴ 20

0

Entering edit mode

Try this: C: Sample names for TCGA data from GDC-legacy archive

ADD REPLY • link 5.9 years ago by Kevin Blighe 87k

0

Entering edit mode

Thanks. but I would like an help in python language. The URL that you gave me is for R.

ADD REPLY • link updated 11 months ago by Ram 43k • written 5.9 years ago by emiliano.traini ▴ 20

0

Entering edit mode

I guess you can either do the R->py algorithm conversion yourself, or branch out to R as an intermediate step. I don't think it's fair to expect a solution in your language of choice without good reason why an existing solution is not usable.

ADD REPLY • link 5.9 years ago by Ram 43k

score 1 · Answer 1 · 2022-01-18

#two inputs needed: 
#1) GDC Data Transfer tool downloaded gene expression data unzipped in a directory defined by path variable
#2) manifest file from GDC Data Transfer tool download

columnnames = ["File", str(gene1), str(gene2), "Dtype"]
df = pd.DataFrame(columns = columnnames)

cwd = os.getcwd()
path = cwd + "\input"

i = 1
for tempname in os.listdir(path):
    with open(os.path.join(path, tempname), 'r') as f: # open in readonly mode
        tempdf = pd.read_csv(f, sep="\t", engine='python', header=None)

        filename = tempname + ".gz"
        row1 = tempdf.loc[tempdf[0] == gene1]
        value1 = float(row1[1])
        row2 = tempdf.loc[tempdf[0] == gene2]
        value2 = float(row2[1])

        fields = ["disease_type"]
        params = {"fields": fields}
        files_endpt = 'https://api.gdc.cancer.gov/files/'

        filenamelist = list(manifestdf['filename'])
        fileIDindex = filenamelist.index(filename)
        fileIDlist = manifestdf['id']
        fileID = fileIDlist[fileIDindex]

        response = requests.get((files_endpt + fileID + '?expand=cases'), params = params)
        json_data = json.loads(response.text)
        tempdtype = json_data['data']

        casedatalist = tempdtype['cases']
        casedata = casedatalist[0]            
        dtype = casedata['disease_type']

        newrow = [filename, value1, value2, dtype]
        df.loc[len(df.index)] = newrow
        print(i, filename, dtype)
        i = i + 1