Hey everyone. So I am trying to design a function in python that uses information from Uniprot in regards to the features a given protein has. The features I am interested in accessing are regions, domains, and secondary structures.
I can access the API already and get the amino acid sequence of a protein of interest using simple code such as:
import json
import urllib
UNIPROT_API_URL = "https://rest.uniprot.org/uniprotkb"
url = '{}/{}.json'.format(UNIPROT_API_URL, protein_name)
uniprot_results = json.load(urllib.request.urlopen(url))
print(uniprot_results['sequence']['value'])
However, this approach, while getting me some of the information I need, does not get me everything I need for my code. Besides the amino acid sequence of the protein, I also need the features of the protein (e.g. domain, region, and secondary structures), as well as the start and end positions of said features within the amino acid sequence). However, my efforts to locate and retrieve this information
from Uniprot have so far been unsuccessful. I know this information is present on Uniprot, as can be seen for this particular entry for A0A075B716 (https://www.uniprot.org/uniprotkb/P08708/entry#ptm_processing). Furthermore, I wanted to try and distinguish between proteins using their taxonomy (e.g. only getting A0A075B716 from humans), and while I know this is possible with a URL such as UNIPROT_API_URL = "https://rest.uniprot.org/uniprotkb/search?query(reviewed:true)%20AND%20(organism_id:9606)"
, I still am having difficulty trying to figure out how to set up the URL, the query, and other relevant parameters. It seems like this information can be accessed through https://www.ebi.ac.uk/proteins/api/doc/#/, but I'm not sure how the API request can be set up, other than this set up of requestURL = "https://www.ebi.ac.uk/proteins/api/proteins?offset=0&size=100&accession=A0A075B716"
. Which, also gives me no information about the features that I need. Also, given what I have read here (https://groups.google.com/g/ebi-proteins-api/c/4Puf0txfeI8), it seems like the API to EBI is deprecated and works directly from Uniprot now, although how this is possible remains a mystery to me.
I find myself really hitting a wall right now in terms of how I should approach this problem. The code I am basing my work on utilized a .tsv file that accumulated all their relevant Uniprot annotations into a dataframe that looked like this: Which is basically the kind of dataframe I am also trying to generate, but per protein_id. If there is another way to access the Uniprot API that I am not using right now, it would be great to find out.
UPDATE
Ok, so it seems like I was looking at feature information from this protein (https://www.uniprot.org/uniprotkb/P08708/entry), which is an isoform of the protein of interest (https://www.uniprot.org/uniprotkb/A0A075B716/entry). While the former has features (structural information), the latter does not. So I can see why this is a problem. That said, I still am curious as to why my source has feature information for the protein A0A075B716, despite it not being ostensibly present on Uniprot.
You can obtain the proteome data for supported organisms from UniProt FTP site. The
.dat
files in each proteome folder contain the data in UniProt format. You can parse out anything you need (features are located inFT
lines).UniProt flat file manual is available here: https://web.expasy.org/docs/userman.html
Thanks GenoMax but I am afraid that, while the Uniprot FTP site does indeed have the information I need, it is quite opaque as how I can access the feature information here. Firstly, the site is organized into Eukaryota, Archaea, Bacteria, and Viruses. Then, it is organized into sections labeled UP000000226, UP000000227, etc., which, given what I've researched, are different proteomes. But these proteome IDs are different from taxonomic IDs (https://www.uniprot.org/help/proteome_id), and it seems like there can be multiple proteomes per one species. So I'm not sure how I can access the protein from the right species, or even how I can access the protein's data within these proteomes using the API.
What do you have in hand that you are trying to search with? Organism names/taxID? This README file contains a complete list of reference proteomes available so you could simply parse the
UP*
accessions you need and then get the relevant files.UniProt support (tagging Elisabeth Gasteiger ) stops by periodically and may have a different answer.
Right now I am just trying to search for proteins from humans (tax ID:9606). I have found some luck in modifying my API URL so that it gets the right protein (by accession number and species) (for example, looking for the protein A0A075B716 in humans would require: https://rest.uniprot.org/uniprotkb/search?&query=organism_id:9606&accession:A0A075B716). However, this results in giving me 25 entries for the same, which is much more than what I need. Perusing through these entries show that they have entirely different accession numbers than A0A075B716– they are completely different proteins than what I requested. Interestingly, when I use the URL https://rest.uniprot.org/uniprotkb/search?&query=accession:A0A075B716&organism_id=9606, I get the right protein entry, although it has no feature information whatsoever.
I could use your recommended Uniprot FTP site, which has the human proteome of UP000005640, but all I want is the ability to make an API request that uses the protein's accession number and species of interest to retrieve the protein's sequence, some description information, and features. Downloading the entire proteome definitely seems like it would not be the most efficient approach, and as far as I can tell, I can't access the information in the proteome using the Uniprot API. Also, some of the proteins I have been trying to analyze (namely A0A075B716) are not even present in the UP000005640.dat file.
Where did you get this query syntax from? It seems that the use of "&" is not correct here, and causes the second clause to be ignored. This explains why the first query would return all human entries (the first 25 of them), while the second one returns A0A075B716 and ignores the organism constraint.
BTW as accession numbers are unique, it is not necessary to include an additional organism constraint.
In order to make sure that an API query returns what you expect, I would recommend that you start with an interactive query on the UniProt website, and once you are sure the results correspond to what you need, click on "Share", then on "Generate URL for API", select your format, click on "Generate URL for API" again and then submit. This will return the URL you can use in your program:
API URL using the streaming endpoint. This endpoint is resource-heavy but will return all requested results.
API URL using the search endpoint. This endpoint is lighter and returns chunks of 500 at a time and requires pagination.