Question: How to retrieve EC numbers and KOs for proteins of several taxons?
1
gravatar for cleb
13 months ago by
cleb60
cleb60 wrote:

This is cross-posted from here.

I would like to use uniprot's sparql endpoint to retrieve all proteins that

  1. are reviewed (required)
  2. are associated with taxonomy IDs 562 and 3702 (required)
  3. have a KO associated with them (optional)
  4. "evidence for the existence of a protein " should be either on protein or transcript level (required)
  5. have an EC number associated with them (required)

I have so far (points 1 and 2):

PREFIX up:<http://purl.uniprot.org/core/>
PREFIX taxon:<http://purl.uniprot.org/taxonomy/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>

SELECT ?protein ?taxon ?name
WHERE
{        
        ?taxon a up:Taxon .
        ?taxon up:scientificName ?name .
        VALUES ?taxonlist { taxon:562 taxon:3702 }
        ?taxon rdfs:subClassOf ?taxonlist .

        ?protein a up:Protein .
        ?protein up:organism ?taxon . 
        ?protein up:reviewed true .  # have to be reviewed        

}

This, however, does not return anything for 3702. How can this be fixed and how can I incorporate points 3-5?

Additionally, is there now a way to connect uniprot's sparql endpoint with rhea's sparql endpoint to retrieve all associated reactions and their stoichiometries (with ChEBI IDs) for the selected proteins from above? Example 19 seems to suggest that this connection is possible but I am not quite sure how to accomplish it.

semantic-web uniprot sparql • 486 views
ADD COMMENTlink modified 13 months ago by me690 • written 13 months ago by cleb60
2
gravatar for me
13 months ago by
me690
Switzerland
me690 wrote:

1) Is correct in the query with

 ?protein up:reviewed true .

2) The query in the Q. does not return anything for taxon:3702 as there are no rdfs:subClasses for Aribidopsis Thaliana, it is a leaf node. This means the entry is directly linked to that taxon instead of via it's ancestors. This is fixed by changing the query slightly to deal with both the ancestor and direct case (both sides of the UNION below)

    VALUES ?taxonlist { taxon:3702 taxon:562}
    {
        ?taxon rdfs:subClassOf ?taxonlist .
        ?protein up:organism ?taxon . 
    } UNION {
        ?protein up:organism ?taxonlist . 
    }

3) we use the cross reference section which are done via rdfs:seeAlso . But as there is the possibility of more than one KO per entry we group them with a subquery.

OPTIONAL {
    SELECT ?protein (GROUP_CONCAT(?ko; SEPARATOR=", ") AS ?kos)
    WHERE{
      ?protein rdfs:seeAlso ?ko .
      ?ko up:database <http:
    } GROUP BY ?protein
}

4) to use the existience/evidence for concept at Protein or Transcript level we add

{
     ?protein up:existence up:Evidence_at_Protein_Level_Existence .
} UNION {
    ?protein up:existence up:Evidence_at_Transcript_Level_Existence .
}

5) To make sure the entry is annotated as an enzyme. We use the same subquery idea as for the KO links but now not OPTIONAL. To make one value of the many potential ECs we use a subquery with a GROUP_CONCAT. The long line with up:enzyme is the different ways uniprot links an ?ec to an entry.

SELECT ?protein (GROUP_CONCAT(?ec; SEPARATOR=", ") AS ?ecs)
WHERE{
  ?protein up:enzyme|((up:component|up:domain)/up:enzyme) ?ec
} GROUP BY ?protein

Combing it in one query gives

PREFIX up:<http://purl.uniprot.org/core/>
PREFIX taxon:<http://purl.uniprot.org/taxonomy/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>

SELECT 
    ?protein 
    ?taxon 
    ?name
    ?kos
    ?ecs
WHERE
{   
    ?protein a up:Protein .
    ?protein up:reviewed true .  # have to be reviewed        
    ?taxon a up:Taxon .
    ?taxon up:scientificName ?name .
    VALUES ?taxonlist { taxon:3702  taxon:562 }
    {
        ?taxon rdfs:subClassOf ?taxonlist .
        ?protein up:organism ?taxon . 
    } UNION {
        ?protein up:organism ?taxonlist . 
    }

    {
        ?protein up:existence up:Evidence_at_Protein_Level_Existence .
    } UNION {
        ?protein up:existence up:Evidence_at_Transcript_Level_Existence .
    }
    {
       SELECT ?protein (GROUP_CONCAT(?ec; SEPARATOR=", ") AS ?ecs)
       WHERE{
           ?protein up:enzyme|((up:component|up:domain)/up:enzyme) ?ec
       } GROUP BY ?protein
    }
    OPTIONAL {
        SELECT ?protein (GROUP_CONCAT(?ko; SEPARATOR=", ") AS ?kos)
        WHERE{
            ?protein rdfs:seeAlso ?ko .
            ?ko up:database <http://purl.uniprot.org/database/KO>
        } GROUP BY ?protein
    } 
}

Which is testable at sparql.uniprot.org.

ADD COMMENTlink modified 13 months ago • written 13 months ago by me690

Running out of space for the Rhea part, we will make a separate Q&A

ADD REPLYlink written 13 months ago by me690

I now opened a new question here. Thanks for helping out!

ADD REPLYlink written 13 months ago by cleb60

Did you have a chance to look at the second question (no pressure, just very curious :) )? If so, is this connection possible? Alternatively, one could maybe also try to get all reactions (substrates, products and stoichiometric factors) for all the EC numbers. Thanks!

ADD REPLYlink modified 13 months ago • written 13 months ago by cleb60

By chemistry is a bit limited so I need my colleague to help with stoichiometric factors and it's production week so time is hard to get.

ADD REPLYlink written 13 months ago by me690

Thanks for the reply. I opened a more specific question for this here. I guess one can infer directly from the scheme how to access the stoichiometries but my attempts all failed.

ADD REPLYlink modified 13 months ago • written 13 months ago by cleb60

I added a new post here; would be gr5eat if you could take a look, thanks!

ADD REPLYlink written 13 months ago by cleb60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1724 users visited in the last hour