Question

Accessing UNIPROT using REST API

0

Entering edit mode

6.4 years ago

Natasha ▴ 40

Hello Everyone, I would like to programmatically access the entries(uniprot id,entry name,protein name,gene name,kinetics) for a given EC Number and organism of interest, using python.

import urllib,urllib2

url = 'http://www.uniprot.org/uploadlists/'

params = {
'from':'ACC',
'to':'P_REFSEQ_AC',
'format':'tab',
'query':'P13368 P20806 Q9UM73 P97793 Q17192'
}

data = urllib.urlencode(params)
request = urllib2.Request(url, data)
contact = "" # Please set your email address here to help us debug in case of problems.
request.add_header('User-Agent', 'Python %s' % contact)
response = urllib2.urlopen(request)
page = response.read(200000)

I had a look at the above python code given here.However,I couldn't really understand how the code should be modified to download the search result(here) in xml format .

In the above code my query is ------'query':'3.1.3.9 2.7.1.2' and format is -----"format": 'xml' How do we add the organism filter("Organism":'Homo sapiens') in the code and download the xml file of the serach result?

Many thanks,

Deepa

programmatic access python REST API UNIPROT • 9.7k views

ADD COMMENT • link updated 20 months ago by Wayne ★ 2.0k • written 6.4 years ago by Natasha ▴ 40

2

Entering edit mode

The UniProt IDmapping doesn't actually support EC numbers. For performance reasons, databases where the mapping relationship to UniProtKB identifiers is one-to-many, e.g. GO, InterPro or PubMed, are not supported. There is a note about this in the help page http://www.uniprot.org/help/uploadlists.

You can however build RESTful queries of the form

http://www.uniprot.org/uniprot/?query=(ec%3A+3.1.3.9+or+ec%3A2.7.1.2)+organism%3A9606&format=xml

You could also use the tab-delimited format:

http://www.uniprot.org/uniprot/?query=(ec%3A+3.1.3.9+or+ec%3A2.7.1.2)+organism%3A9606&format=tab&columns=id,entry_name,protein_names,genes,comment(KINETICS)

ADD REPLY • link 6.4 years ago by Elisabeth Gasteiger ★ 2.4k

0

Entering edit mode

This solution no longer works as noted by @roder.thomas.

Elisabeth Gasteiger - Is there an update that can be posted instead? Otherwise this answer should be moved to a comment for historical reference.

Note: A new answer has been added so this originally accepted answer has been moved to a comment for reference. It is not longer valid.

ADD REPLY • link 20 months ago by GenoMax 141k

score 3 · Answer 1 · 2022-09-01

Summer 2022, there's a Python package for querying UniProt's new REST API, by Michael Milton(multimeric), called Unipressed.

Announcement:

I've developed a #python package for querying UniProt's new REST API! Maybe the first to fully support the new format. Check it out at https://t.co/tUtK0XI7vv.
In particular I've tried hard to integrate with Python tooling, giving you great code completion:#bioinformatics https://t.co/eVHEKgV4F1 pic.twitter.com/HMVaKEjPvR
— Michael Milton (@multimeric) August 3, 2022

Unipressed Github repo.
Unipressed documentation.

Demonstration Code Using Unipressed (consistent with examples in earlier posts):

from unipressed import UniprotkbClient

for record in UniprotkbClient.search(
    query={
        "or_": [
        {"ec": "3.1.3.9"},
        {"ec": "2.7.1.2"},
        ],
        "and_": [
        {"organism_id": "9606"},
        ]
    },
    #fields=["length", "gene_names"]
).each_record():
    display(record)

The documentation for Unipressed, presently under 'Advantages' it says it supports formats json, tsv, list, and xml:

Here is choosing tsv format:

from unipressed import UniprotkbClient

for record in UniprotkbClient.search(
    query={
        "or_": [
        {"ec": "3.1.3.9"},
        {"ec": "2.7.1.2"},
        ],
        "and_": [
        {"organism_id": "9606"},
        ]
    },
    format="tsv",
    fields=["accession","gene_names", "length"]
).each_record():
    display(record)

That results in:

{'Entry': 'Q9NQR9', 'Gene Names': 'G6PC2 IGRP', 'Length': '355'}
{'Entry': 'P35575', 'Gene Names': 'G6PC1 G6PC G6PT', 'Length': '357'}
{'Entry': 'Q9BUM1', 'Gene Names': 'G6PC3 UGRP', 'Length': '346'}
{'Entry': 'P35575-2', 'Gene Names': 'G6PC1 G6PC G6PT', 'Length': '176'}
{'Entry': 'Q9NQR9-2', 'Gene Names': 'G6PC2 IGRP', 'Length': '102'}
{'Entry': 'Q9NQR9-3', 'Gene Names': 'G6PC2 IGRP', 'Length': '154'}
{'Entry': 'A0A024R1U9', 'Gene Names': 'G6PC hCG_16953', 'Length': '359'}

(I went with a very simple form of the output there to show human readable results here. To actually save data as the TSV-formatted text, you can adapt the approach used at the end of Michael Milton's (multimeric) reply to this post below, as I do with the above example code here.)

This gives seven hits as opposed to the four shown in the direct results at the site in the August 31, 2022 post by @roder.thomas. This is because this query results include the isoforms in the primary accessions of hits, and so in addition to the four shown in the August 31, 2022 post by @roder.thomas:

Q9NQR9
P35575
Q9BUM1
A0A024R1U9

You also see listed:

P35575-2
Q9NQR9-2
Q9NQR9-3

Those isoforms are listed under the section 'Sequence & Isoforms' in the entry pages accessible from the screen in the August 31, 2022 post by @roder.thomas.

You can filter those isoforms to get the 4 seen in the direct access by filtering out any where there's a dash in in the name, like so:

from unipressed import UniprotkbClient

collected=[]
for record in UniprotkbClient.search(
    query={
        "or_": [
        {"ec": "3.1.3.9"},
        {"ec": "2.7.1.2"},
        ],
        "and_": [
        {"organism_id": "9606"},
        ]
    },
    fields=["length", "gene_names"]
).each_record():
    collected.append(record)
collected = [x for x in collected if "-" not in x["primaryAccession"]]

XML Format Example:

The original post in particular asked about downloading the results in XML format. And Unipressed has that built in already. Here some accessing & printing of data stored in the XML record object is done to show something human readable:

from unipressed import UniprotkbClient

for record in UniprotkbClient.search(
    query={
        "or_": [
        {"ec": "3.1.3.9"},
        {"ec": "2.7.1.2"},
        ],
        "and_": [
        {"organism_id": "9606"},
        ]
    },
    format="xml",
).each_record():
    #Show XML object as string by uncommenting out the next two lines & deleting everything after those lines
    #from xml.etree import ElementTree # from https://stackoverflow.com/a/48671499/8508004
    #print(ElementTree.tostring(record, encoding='unicode'))
    #Below based on [Processing XML in Python — ElementTree:A Beginner’s Guide](https://towardsdatascience.com/processing-xml-in-python-elementtree-c8992941efd2)
    # slice `[28:]` added to remove `{http://uniprot.org/uniprot}` from the front of tags
    #[print(elem.tag[28:]) for elem in record.iter()]
    #[print(child.tag, child.attrib) for child in record]
    [print(elem.tag[28:], elem.attrib, elem.text) for elem in record.iter('{http://uniprot.org/uniprot}fullName')]
    [print(elem.tag[28:], elem.attrib, elem.text) for elem in record.iter('{http://uniprot.org/uniprot}ecNumber')]
    [print(elem.tag[28:], elem.attrib) for elem in record.iter('{http://uniprot.org/uniprot}proteinExistence')]
    print("*"*60)

Results in:

fullName {} Glucose-6-phosphatase 2
fullName {} Islet-specific glucose-6-phosphatase catalytic subunit-related protein
ecNumber {} 3.1.3.9
proteinExistence {'type': 'evidence at protein level'}
************************************************************
fullName {'evidence': '36'} Glucose-6-phosphatase catalytic subunit 1
fullName {} Glucose-6-phosphatase
fullName {} Glucose-6-phosphatase alpha
ecNumber {'evidence': '9 12 16 25'} 3.1.3.9
proteinExistence {'type': 'evidence at protein level'}
************************************************************
fullName {} Glucose-6-phosphatase 3
fullName {} Glucose-6-phosphatase beta
fullName {} Ubiquitous glucose-6-phosphatase catalytic subunit-related protein
ecNumber {} 3.1.3.9
proteinExistence {'type': 'evidence at protein level'}
************************************************************
fullName {'evidence': '5'} Isoform 2 of Glucose-6-phosphatase catalytic subunit 1
fullName {} Glucose-6-phosphatase
fullName {} Glucose-6-phosphatase alpha
ecNumber {'evidence': '1 2 3 4'} 3.1.3.9
proteinExistence {'type': 'evidence at protein level'}
************************************************************
fullName {} Isoform 2 of Glucose-6-phosphatase 2
fullName {} Islet-specific glucose-6-phosphatase catalytic subunit-related protein
ecNumber {} 3.1.3.9
proteinExistence {'type': 'evidence at protein level'}
************************************************************
fullName {} Isoform 3 of Glucose-6-phosphatase 2
fullName {} Islet-specific glucose-6-phosphatase catalytic subunit-related protein
ecNumber {} 3.1.3.9
proteinExistence {'type': 'evidence at protein level'}
************************************************************
fullName {'evidence': '4'} Glucose-6-phosphatase
ecNumber {'evidence': '4'} 3.1.3.9
proteinExistence {'type': 'inferred from homology'}
************************************************************

score 1 · Answer 2 · 2022-09-01

Unfortunately the REST API underwent considerable modifications along with the recent website redesign.

In the new query syntax, the query would be

((ec:3.1.3.9) OR (ec:2.7.1.2)) AND (organism_id:9606)

https://www.uniprot.org/uniprotkb?query=%28%28ec%3A3.1.3.9%29%20OR%20%28ec%3A2.7.1.2%29%29%20AND%20%28organism_id%3A9606%29

If you want the results corresponding to this query in XML format, you can indeed use the above-mentioned "Generate URL for API" link which will show the following:

API URL using the streaming endpoint

This endpoint is resource-heavy but will return all requested results.

https://rest.uniprot.org/uniprotkb/stream?compressed=true&format=fasta&query=%28%28%28ec%3A3.1.3.9%29%20OR%20%28ec%3A2.7.1.2%29%29%20AND%20%28organism_id%3A9606%29%29

API URL using the search endpoint

This endpoint is lighter and returns chunks of 500 at a time and requires pagination.

https://rest.uniprot.org/uniprotkb/search?compressed=true&format=fasta&query=%28%28%28ec%3A3.1.3.9%29%20OR%20%28ec%3A2.7.1.2%29%29%20AND%20%28organism_id%3A9606%29%29&size=500

score 0 · Answer 3 · 2022-08-31

0

Entering edit mode

20 months ago

roder.thomas ▴ 20

These pages do not work anymore. But UniProt included a API query generator to the website!

how to generate API query

ADD COMMENT • link 20 months ago by roder.thomas ▴ 20