Question

How to retrieve EC numbers and KOs for proteins of several taxons?

1

Entering edit mode

5.9 years ago

cleb ▴ 70

This is cross-posted from here.

I would like to use uniprot's sparql endpoint to retrieve all proteins that

are reviewed (required)
are associated with taxonomy IDs 562 and 3702 (required)
have a KO associated with them (optional)
"evidence for the existence of a protein " should be either on protein or transcript level (required)
have an EC number associated with them (required)

I have so far (points 1 and 2):

PREFIX up:<http://purl.uniprot.org/core/>
PREFIX taxon:<http://purl.uniprot.org/taxonomy/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>

SELECT ?protein ?taxon ?name
WHERE
{        
        ?taxon a up:Taxon .
        ?taxon up:scientificName ?name .
        VALUES ?taxonlist { taxon:562 taxon:3702 }
        ?taxon rdfs:subClassOf ?taxonlist .

        ?protein a up:Protein .
        ?protein up:organism ?taxon . 
        ?protein up:reviewed true .  # have to be reviewed        

}

This, however, does not return anything for 3702. How can this be fixed and how can I incorporate points 3-5?

Additionally, is there now a way to connect uniprot's sparql endpoint with rhea's sparql endpoint to retrieve all associated reactions and their stoichiometries (with ChEBI IDs) for the selected proteins from above? Example 19 seems to suggest that this connection is possible but I am not quite sure how to accomplish it.

sparql uniprot semantic-web • 1.7k views

ADD COMMENT • link updated 5.9 years ago by me ▴ 760 • written 5.9 years ago by cleb ▴ 70

score 2 · Accepted Answer · 2018-06-06

1) Is correct in the query with

 ?protein up:reviewed true .

2) The query in the Q. does not return anything for taxon:3702 as there are no rdfs:subClasses for Aribidopsis Thaliana, it is a leaf node. This means the entry is directly linked to that taxon instead of via it's ancestors. This is fixed by changing the query slightly to deal with both the ancestor and direct case (both sides of the UNION below)

    VALUES ?taxonlist { taxon:3702 taxon:562}
    {
        ?taxon rdfs:subClassOf ?taxonlist .
        ?protein up:organism ?taxon . 
    } UNION {
        ?protein up:organism ?taxonlist . 
    }

3) we use the cross reference section which are done via rdfs:seeAlso . But as there is the possibility of more than one KO per entry we group them with a subquery.

OPTIONAL {
    SELECT ?protein (GROUP_CONCAT(?ko; SEPARATOR=", ") AS ?kos)
    WHERE{
      ?protein rdfs:seeAlso ?ko .
      ?ko up:database <http:
    } GROUP BY ?protein
}

4) to use the existience/evidence for concept at Protein or Transcript level we add

{
     ?protein up:existence up:Evidence_at_Protein_Level_Existence .
} UNION {
    ?protein up:existence up:Evidence_at_Transcript_Level_Existence .
}

5) To make sure the entry is annotated as an enzyme. We use the same subquery idea as for the KO links but now not OPTIONAL. To make one value of the many potential ECs we use a subquery with a GROUP_CONCAT. The long line with up:enzyme is the different ways uniprot links an ?ec to an entry.

SELECT ?protein (GROUP_CONCAT(?ec; SEPARATOR=", ") AS ?ecs)
WHERE{
  ?protein up:enzyme|((up:component|up:domain)/up:enzyme) ?ec
} GROUP BY ?protein

Combing it in one query gives

PREFIX up:<http://purl.uniprot.org/core/>
PREFIX taxon:<http://purl.uniprot.org/taxonomy/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>

SELECT 
    ?protein 
    ?taxon 
    ?name
    ?kos
    ?ecs
WHERE
{   
    ?protein a up:Protein .
    ?protein up:reviewed true .  # have to be reviewed        
    ?taxon a up:Taxon .
    ?taxon up:scientificName ?name .
    VALUES ?taxonlist { taxon:3702  taxon:562 }
    {
        ?taxon rdfs:subClassOf ?taxonlist .
        ?protein up:organism ?taxon . 
    } UNION {
        ?protein up:organism ?taxonlist . 
    }

    {
        ?protein up:existence up:Evidence_at_Protein_Level_Existence .
    } UNION {
        ?protein up:existence up:Evidence_at_Transcript_Level_Existence .
    }
    {
       SELECT ?protein (GROUP_CONCAT(?ec; SEPARATOR=", ") AS ?ecs)
       WHERE{
           ?protein up:enzyme|((up:component|up:domain)/up:enzyme) ?ec
       } GROUP BY ?protein
    }
    OPTIONAL {
        SELECT ?protein (GROUP_CONCAT(?ko; SEPARATOR=", ") AS ?kos)
        WHERE{
            ?protein rdfs:seeAlso ?ko .
            ?ko up:database <http://purl.uniprot.org/database/KO>
        } GROUP BY ?protein
    } 
}

Which is testable at sparql.uniprot.org.