Converting UUID downloaded from GDC to TCGA names
1
2
Entering edit mode
5.9 years ago

Hi!

I have downloaded files regarding breast cancer genome transcription with the following python code:

import requests
import json
import re
url = "https://api.gdc.cancer.gov/files"
filters = {
    "op": "and",
    "content":[
        {
        "op": "in",
        "content":{
            "field": "cases.primary_site",
            "value": ["Breast"]
            }
        },
        {
        "op": "in",
        "content":{
            "field": "files.analysis.workflow_type",
            "value": ["HTSeq - FPKM-UQ"]
            }
        },
        {
        "op": "in",
        "content":{
            "field": "files.data_category",
            "value": ["Transcriptome Profiling"]
            }
        }
    ]
}

params = {
    "filters" : json.dumps(filters), # prende un oggetto (filters) e return stringa
    "fields" : "file_id",
    "format" : "JSON",
    "size" : "2000"
    }

r = requests.get(url, params = params)
file_uuid_list = []
for file_entry in json.loads(r.content.decode("utf-8"))["data"]["hits"]:
    file_uuid_list.append(file_entry["file_id"])

url_data = "https://api.gdc.cancer.gov/data"

params = {"ids": file_uuid_list}

response = requests.post(url_data, data = json.dumps(params), headers = {"Content-Type": "application/json"})

response_head_cd = response.headers["Content-Disposition"]

file_name = re.findall("filename=(.+)", response_head_cd)[0]

with open(file_name, "wb") as output_file:
    output_file.write(response.content)

I can't manage to find the TCGA names of the downloaded files. I have tried to modify the code with the following one:

params = {
    "filters" : json.dumps(filters), # prende un oggetto (filters) e return stringa
    "fields" : "file_id, cases.submitter_id, cases.case_id",
    "format" : "JSON",
    "size" : "2000"
    }

but it doesn't work because maybe the URL is wrong (I have connection error in request.get())

gdc python • 3.7k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Thanks. but I would like an help in python language. The URL that you gave me is for R.

ADD REPLY
0
Entering edit mode

I guess you can either do the R->py algorithm conversion yourself, or branch out to R as an intermediate step. I don't think it's fair to expect a solution in your language of choice without good reason why an existing solution is not usable.

ADD REPLY
1
Entering edit mode
2.3 years ago
Laura ▴ 30
#two inputs needed: 
#1) GDC Data Transfer tool downloaded gene expression data unzipped in a directory defined by path variable
#2) manifest file from GDC Data Transfer tool download

columnnames = ["File", str(gene1), str(gene2), "Dtype"]
df = pd.DataFrame(columns = columnnames)

cwd = os.getcwd()
path = cwd + "\input"

i = 1
for tempname in os.listdir(path):
    with open(os.path.join(path, tempname), 'r') as f: # open in readonly mode
        tempdf = pd.read_csv(f, sep="\t", engine='python', header=None)

        filename = tempname + ".gz"
        row1 = tempdf.loc[tempdf[0] == gene1]
        value1 = float(row1[1])
        row2 = tempdf.loc[tempdf[0] == gene2]
        value2 = float(row2[1])

        fields = ["disease_type"]
        params = {"fields": fields}
        files_endpt = 'https://api.gdc.cancer.gov/files/'

        filenamelist = list(manifestdf['filename'])
        fileIDindex = filenamelist.index(filename)
        fileIDlist = manifestdf['id']
        fileID = fileIDlist[fileIDindex]

        response = requests.get((files_endpt + fileID + '?expand=cases'), params = params)
        json_data = json.loads(response.text)
        tempdtype = json_data['data']

        casedatalist = tempdtype['cases']
        casedata = casedatalist[0]            
        dtype = casedata['disease_type']

        newrow = [filename, value1, value2, dtype]
        df.loc[len(df.index)] = newrow
        print(i, filename, dtype)
        i = i + 1
ADD COMMENT

Login before adding your answer.

Traffic: 1672 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6