Question: UniProt SPARQL: retrieving proteins of a complete proteome (Escherichia coli K12)
2
gravatar for amorgat
18 months ago by
amorgat10
Switzerland
amorgat10 wrote:

My question: how to retrieve all entries of Escherichia coli K12 proteome using UniProt SPARQL endpoint?

endpoint: http://sparql.uniprot.org/sparql

Context: I want to get entries of UniProt ECOLI (Escherichia coli K-12) complete proteome. I expect to find only UniProtKB/Swiss-Prot proteins (reviewed entries).

I did the following things:

Retrieve the list of proteins with KW 'complete proteome' (keywords:181) and taxon:83333 (Escherichia coli K-12)

PREFIX keywords:<http://purl.uniprot.org/keywords/> 
PREFIX taxon:<http://purl.uniprot.org/taxonomy/> 
PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>      
SELECT ?reviewed (count(distinct ?protein) as ?proteinCount)
WHERE
{
    ?protein a up:Protein ;
                   up:organism ?organism ;
                    up:organism taxon:83333 ;
                     up:classifiedWith|(up:classifiedWith/rdfs:subClassOf)?kw ;
                      up:reviewed ?reviewed .
    ?kw a up:Concept .
    VALUES (?kw) { (keywords:181) }
}
GROUP BY ?reviewed

result:

|reviewed |proteinCount|

|"true"xsd:boolean |"4313"xsd:int|

|"false"xsd:boolean |"2"xsd:int|

It is an unexpected result for me, as there are 2 TrEMBL entries (up:reviewed false).

In fact, an organism may have several proteomes. Well, with E. coli, I should have anticipated that... anyway! Definition of proteomes is well-documented in the UniProt web site (reference_proteome). And effectively there are 2 proteomes for Escherichia coli K-12:

Instead of keywords:181 (complete proteome), I should have used keywords:1185 (reference proteome) (KW-1185):

PREFIX keywords:<http://purl.uniprot.org/keywords/> 
PREFIX taxon:<http://purl.uniprot.org/taxonomy/> 
PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 

SELECT ?reviewed (count(distinct ?protein) as ?proteinCount)
WHERE
{
    ?protein a up:Protein ;
                   up:organism ?organism ;
                    up:organism taxon:83333 ;
                     up:classifiedWith|(up:classifiedWith/rdfs:subClassOf)?kw ;
                      up:reviewed ?reviewed .
    ?kw a up:Concept .
    VALUES (?kw) { (keywords:1185) }
}
GROUP BY ?reviewed

result:

|reviewed |proteinCount|

|"true"xsd:boolean |"4313"xsd:int|

bingo!

Let's display proteome data for taxon:83333

PREFIX taxon:<http://purl.uniprot.org/taxonomy/> 
PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX proteome:<http://purl.uniprot.org/proteome/> 
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 

SELECT ?proteome ?reviewed  (count(distinct ?protein) as ?proteinCount)
WHERE
{
    ?protein a up:Protein ;
                   up:organism ?organism ;
                    up:organism taxon:83333 ;
                     up:proteome ?proteome ;
                      up:reviewed ?reviewed .
}
GROUP BY ?reviewed ?proteome

proteome reviewed proteinCount

http://purl.uniprot.org/proteomes/UP000000318#Chromosome "true"xsd:boolean "4255"xsd:int

http://purl.uniprot.org/proteomes/UP000000318#Chromosome "false"xsd:boolean "2"xsd:int

http://purl.uniprot.org/proteomes/UP000000625#Chromosome "true"xsd:boolean "4313"xsd:int

And now, get the list of proteins for UP000000625#Chromosome

SPARQL query:

PREFIX proteome:<http://purl.uniprot.org/proteome/> 
PREFIX uniprotkb:<http://purl.uniprot.org/uniprot/> 
PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 

SELECT ?proteome ?protein
WHERE
{
    ?protein a up:Protein ;
                   up:reviewed true ;
                    up:proteome ?proteome .
    VALUES (?proteome) {(proteome:UP000000625#Chromosome)}
}

Unfortunately, I get an error message

Encountered " "}" "} "" at line 13, column 1. Was expecting one of: ")" ... "true" ... "false" ... "UNDEF" ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

Replacing UP000000625#Chromosome by UP000000625 gives no result

Question: who know how to retrieve proteins of a given UniProt proteome (UP000000625 in my case)?

ADD COMMENTlink modified 18 months ago by me690 • written 18 months ago by amorgat10
1
gravatar for me
18 months ago by
me690
Switzerland
me690 wrote:

There are two issues with this query. The first is that in the PREFIX declaration there is an s missing at the end of proteome i.e.

PREFIX proteome:<http://purl.uniprot.org/proteome/>

should be

PREFIX proteome:<http://purl.uniprot.org/proteomes/>

The second one is more frustrating and is that the character '#' used in the fragment IRI is also seen as the start of a comment in the sparql query.

This means the query needs to be written as

PREFIX uniprotkb:<http://purl.uniprot.org/uniprot/> 
PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 

SELECT ?proteome ?protein
WHERE
{
     ?protein a up:Protein ;
               up:reviewed true ;
                up:proteome ?proteome .
  VALUES (?proteome) {(<http://purl.uniprot.org/proteomes/UP000000625#Chromosome>)}
}

Which produces results as seen here

ADD COMMENTlink written 18 months ago by me690
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1858 users visited in the last hour