Question

UniProt ID mapping API call

2

Entering edit mode

7.0 years ago

Bioaln ▴ 360

Hello. I've been recently trying to programatically convert a bunch of UniProt IDs to gene names. I found the UniProt API, which should do the job, something in the lines of:

import urllib,urllib2

url = 'http://www.uniprot.org/uploadlists/'

params = {
'from':'ACC',
'to':'P_REFSEQ_AC',
'format':'tab',
'query':'P13368 P20806 Q9UM73 P97793 Q17192'
}

data = urllib.urlencode(params)
request = urllib2.Request(url, data)
contact = "" # Please set your email address here to help us debug in case of problems.
request.add_header('User-Agent', 'Python %s' % contact)
response = urllib2.urlopen(request)
page = response.read(200000)

The problem is, this returns whole website. Is it possible to only obtain a e.g. JSON where a list of mappings and corresponding information would be present (e.g. species too).

Thank you.

protein uniprot api python • 9.6k views

ADD COMMENT • link updated 7.0 years ago by Elisabeth Gasteiger ★ 2.4k • written 7.0 years ago by Bioaln ▴ 360

0

Entering edit mode

Have you looked at the flat files? Eg. http://www.uniprot.org/uniprot/Q9UM73.txt.
Its especially easy to parse. It doesn't plug right in to your script there but you could set it up in a loop.

ADD REPLY • link 7.0 years ago by Jake Warner ▴ 840

0

Entering edit mode

Yes, I am aware of the raw files. So you are saying the only way is to parse whole UniProt, instead of calling the API on the level of a single case? This doesn't seem right - the API seems to be capable of returning e.g. json, which should work.. Your example does not work for e.g. gene IDs, does it?

ADD REPLY • link 7.0 years ago by Bioaln ▴ 360

1

Entering edit mode

7.0 years ago

Elisabeth Gasteiger ★ 2.4k

UniProt IDmapping documentation for programmatic access is available here: http://www.uniprot.org/help/api_idmapping

There also is a list of column names for programmatic access: http://www.uniprot.org/help/uniprotkb_column_names . In particular, for gene names, you can choose between the following

Gene names (primary): genes(PREFERRED)
Gene names (synonym): genes(ALTERNATIVE)
Gene names (ordered locus): genes(OLN)
Gene names (ORF): genes(ORF)

ADD COMMENT • link 7.0 years ago by Elisabeth Gasteiger ★ 2.4k

0

Entering edit mode

Please refer to the accepted answer as to why this is not the optimal solution (my question is actually the python code from the proposed link).

ADD REPLY • link 7.0 years ago by Bioaln ▴ 360

0

Entering edit mode

Just wanted to complement my colleague "me"'s reply.... (also for future readers of this thread). Glad you found your solution!

ADD REPLY • link 7.0 years ago by Elisabeth Gasteiger ★ 2.4k

score 2 · Accepted Answer · 2017-11-16

2

Entering edit mode

7.0 years ago

me ▴ 760

You can use requests like

http://www.uniprot.org/uniprot/?query=accession:P13368&format=tab&columns=genes

to access only the gene names. However, the delimiting is a bit odd to parse in this case. i.e. some entries are linked to more than one gene. And genes often have more than one name and its hard to figure whats what in this output.

You can either parse this out of the different file formats or use our sparql endpoint to just ask the preferred gene names directly.

BASE <http://purl.uniprot.org/uniprot/> 
PREFIX skos:<http://www.w3.org/2004/02/skos/core#> 
PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX taxon:<http://purl.uniprot.org/taxonomy/> 
SELECT ?protein ?preferredGeneName 
WHERE
{
    VALUES ?protein {<P13368> <P20806> <Q9UM73> <P97793> <Q17192>}
    ?protein a up:Protein ; 
             up:encodedBy/skos:prefLabel ?preferredGeneName .
}

You can use the download links for this to get the information back as json/xml or csv as you wish and by editing the UniProt accessions in the query you can retrieve all entries you want.

ADD COMMENT • link 7.0 years ago by me ▴ 760

1

Entering edit mode

Thanks, this is a nifty workaround, yet I do not understand why they do not offer simple API calls for this. Time to build a conversion API webserver?

ADD REPLY • link 7.0 years ago by Bioaln ▴ 360

0

Entering edit mode

Well they would be in a significant part be me ;) You can use the upload list facility on www.uniprot.org as well. Then you can use what ever columns you want. The real difficulty is actually with gene names and how they map to/from UniProt entries. The solutions to that are ask for exactly what you want (i.e. SPARQL) or parse out exactly what you want from the TXT/XML/RDF/JSON options.

ADD REPLY • link 7.0 years ago by me ▴ 760

0

Entering edit mode

Thanks for the explanation! Keep up the good work!

ADD REPLY • link 7.0 years ago by Bioaln ▴ 360

0

Entering edit mode

if you want the species/ncbi taxid just add a line "?protein up:organism ?taxon ." at the end of the where clause and "?taxon" on the select line.

ADD REPLY • link 7.0 years ago by me ▴ 760