UniProt SPARQL: retrieving proteins of a complete proteome (Escherichia coli K12)
1
2
Entering edit mode
3.7 years ago
amorgat ▴ 10

My question: how to retrieve all entries of Escherichia coli K12 proteome using UniProt SPARQL endpoint?

endpoint: http://sparql.uniprot.org/sparql

Context: I want to get entries of UniProt ECOLI (Escherichia coli K-12) complete proteome. I expect to find only UniProtKB/Swiss-Prot proteins (reviewed entries).

I did the following things:

Retrieve the list of proteins with KW 'complete proteome' (keywords:181) and taxon:83333 (Escherichia coli K-12)

PREFIX keywords:<http://purl.uniprot.org/keywords/> 
PREFIX taxon:<http://purl.uniprot.org/taxonomy/> 
PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>      
SELECT ?reviewed (count(distinct ?protein) as ?proteinCount)
WHERE
{
    ?protein a up:Protein ;
                   up:organism ?organism ;
                    up:organism taxon:83333 ;
                     up:classifiedWith|(up:classifiedWith/rdfs:subClassOf)?kw ;
                      up:reviewed ?reviewed .
    ?kw a up:Concept .
    VALUES (?kw) { (keywords:181) }
}
GROUP BY ?reviewed

result:

|reviewed |proteinCount|

|"true"xsd:boolean |"4313"xsd:int|

|"false"xsd:boolean |"2"xsd:int|

It is an unexpected result for me, as there are 2 TrEMBL entries (up:reviewed false).

In fact, an organism may have several proteomes. Well, with E. coli, I should have anticipated that... anyway! Definition of proteomes is well-documented in the UniProt web site (reference_proteome). And effectively there are 2 proteomes for Escherichia coli K-12:

Instead of keywords:181 (complete proteome), I should have used keywords:1185 (reference proteome) (KW-1185):

PREFIX keywords:<http://purl.uniprot.org/keywords/> 
PREFIX taxon:<http://purl.uniprot.org/taxonomy/> 
PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 

SELECT ?reviewed (count(distinct ?protein) as ?proteinCount)
WHERE
{
    ?protein a up:Protein ;
                   up:organism ?organism ;
                    up:organism taxon:83333 ;
                     up:classifiedWith|(up:classifiedWith/rdfs:subClassOf)?kw ;
                      up:reviewed ?reviewed .
    ?kw a up:Concept .
    VALUES (?kw) { (keywords:1185) }
}
GROUP BY ?reviewed

result:

|reviewed |proteinCount|

|"true"xsd:boolean |"4313"xsd:int|

bingo!

Let's display proteome data for taxon:83333

PREFIX taxon:<http://purl.uniprot.org/taxonomy/> 
PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX proteome:<http://purl.uniprot.org/proteome/> 
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 

SELECT ?proteome ?reviewed  (count(distinct ?protein) as ?proteinCount)
WHERE
{
    ?protein a up:Protein ;
                   up:organism ?organism ;
                    up:organism taxon:83333 ;
                     up:proteome ?proteome ;
                      up:reviewed ?reviewed .
}
GROUP BY ?reviewed ?proteome

proteome reviewed proteinCount

http://purl.uniprot.org/proteomes/UP000000318#Chromosome "true"xsd:boolean "4255"xsd:int

http://purl.uniprot.org/proteomes/UP000000318#Chromosome "false"xsd:boolean "2"xsd:int

http://purl.uniprot.org/proteomes/UP000000625#Chromosome "true"xsd:boolean "4313"xsd:int

And now, get the list of proteins for UP000000625#Chromosome

SPARQL query:

PREFIX proteome:<http://purl.uniprot.org/proteome/> 
PREFIX uniprotkb:<http://purl.uniprot.org/uniprot/> 
PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 

SELECT ?proteome ?protein
WHERE
{
    ?protein a up:Protein ;
                   up:reviewed true ;
                    up:proteome ?proteome .
    VALUES (?proteome) {(proteome:UP000000625#Chromosome)}
}

Unfortunately, I get an error message

Encountered " "}" "} "" at line 13, column 1. Was expecting one of: ")" ... "true" ... "false" ... "UNDEF" ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

Replacing UP000000625#Chromosome by UP000000625 gives no result

Question: who know how to retrieve proteins of a given UniProt proteome (UP000000625 in my case)?

SPARQL RDF uniprot proteome protein • 1.5k views
ADD COMMENT
1
Entering edit mode
3.7 years ago
me ▴ 740

There are two issues with this query. The first is that in the PREFIX declaration there is an s missing at the end of proteome i.e.

PREFIX proteome:<http://purl.uniprot.org/proteome/>

should be

PREFIX proteome:<http://purl.uniprot.org/proteomes/>

The second one is more frustrating and is that the character '#' used in the fragment IRI is also seen as the start of a comment in the sparql query.

This means the query needs to be written as

PREFIX uniprotkb:<http://purl.uniprot.org/uniprot/> 
PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 

SELECT ?proteome ?protein
WHERE
{
     ?protein a up:Protein ;
               up:reviewed true ;
                up:proteome ?proteome .
  VALUES (?proteome) {(<http://purl.uniprot.org/proteomes/UP000000625#Chromosome>)}
}

Which produces results as seen here

ADD COMMENT

Login before adding your answer.

Traffic: 1901 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6