Question

Downloading dataset of PTM sites from UniProt

2

Entering edit mode

6.8 years ago

rshipman ▴ 30

Hello,

I am currently looking to put together a data set of post translation sites from the UniProt. I am looking to download this set of data from the website and store it in a text or csv, the information in question is in the image below encased in a red box.

data in question I am currently working in R with the package UniProt.ws and am having a hard time pinning down where in the package something along these lines can be done. Maybe there is another package or language out there that is better suited for this job, not sure.

What would be the best option here? Is it possible to pull this data down with an R script as I do not want to copy and paste all of these sites for each protein in question. I basically only want the information on PTM / Processing from UniProt.

Any help would be great.

Edit---------------------------------- for user me or those interested ------------------------------------------------------------------------

Thank you user me, this is what I was looking for, just need some help with which data is pulled and how it is displayed. I have never used this software before so it is new to me, do you know of any tutorials that are directly related to using SPARQL with UniProt? It looks like this is quite the useful bit of language.

So this looks good but I am missing some information, mainly that of glycosylation sites. I would like to pull the following information in the image below. So all the PTM that were pulled plus the glyco sites, not sure why they did not get pulled with this script. Example, N-Linked (........) -- I believe this would fall into the "text" column

ptm+glycosites

What was provided by the script you typed is what I need but I need a bit more. This next image is what I am hoping for in the end data set. I would also like the protein entry and name as well if possible. I tried playing with the code but was unable to see how that all works out.

Wanted Dataset Layout

Again, thank you so much for the help! Your write up has been great and any resources you can point me in the direction of would be great, this tool is amazing! :)

R PTM uniprot • 3.3k views

ADD COMMENT • link updated 6.8 years ago by me ▴ 760 • written 6.8 years ago by rshipman ▴ 30

score 6 · Answer 1 · 2017-07-11

A SPARQL query that gets most of the data

While the different rest service at UniProt are excellent when you are looking at our data in an annotation centric way instead of an entry specific way they get cumbersome. I suggest that you use this style of sparql query instead at http://sparql.uniprot.org.

PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX taxon:<http://purl.uniprot.org/taxonomy/> 
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
PREFIX faldo:<http://biohackathon.org/resource/faldo#> 
SELECT 
       (SUBSTR(STR(?protein), 33) AS ?primaryAccession)
       (SUBSTR(STR(?sequence), 34) AS ?sequenceAccession)
       ?name
       ?begin 
       ?text 
       (SUBSTR(STR(?evidence), 32) AS ?eco)
       ?source
WHERE
{
  ?protein a up:Protein ;
         up:organism taxon:9606 ;  #change the taxid if interested in non human or delete if interested in all
         up:annotation ?annotation ;
         rdfs:label ?name . #this comes from the UniRef graph but is just what we need
  VALUES ?annotationType {
       up:Glycosylation_Annotation 
       up:Modified_Residue_Annotation 
       #add any type of annotation as documented at http://www.uniprot.org/core/
  }
  ?annotation a ?annotationType;
            rdfs:comment ?text ;
            up:range/faldo:begin
            [ faldo:position ?begin ;
                             faldo:reference ?sequence ] .
OPTIONAL {
    [] rdf:object ?annotation ; 
                  up:attribution ?attribution . 
        ?attribution up:evidence ?evidence .
        OPTIONAL {
            ?attribution up:source ?source
        }
    }
}

I added the evidence code in case you are thinking of training some algorithm.

The query selects just the modified residue annotations, if you need more please edit your question and I will adapt this answer.

You can use R sparql module to do most of the the heavy lifting in regards to parsing the outputs. The URIs in the output can be shortend to just accessions and eco codes either in the query or of course in your R code.

Selecting different "types" of annotation

Each type of annotation is given its own class, separated by the predicate "a" in the sparql query.

?x a up:Modified_Residue_Annotation .

or

?y a up:Glycosylation_Annotation .

In the query above they are in the list of arguments given to the VALUES query part.

SPARQL 1.1. allows

Protein names which one to select,

Protein names as recorded in UniProt are tricky. There are different names grouped in interesting ways. You are looking for are submitted and recommended names, with a preference for recommended name in case of a Swiss-Prot entry. There are number of name types, but the fullName one is most likely the one you want.

As there is at most one recommendedName that is easy to get into a query.

OPTIONAL {
    ?protein up:recommendedName/up:fullName ?name .  
 }

Then add ?name to the list of things you want to SELECT. However, any entry can have lots of submittedNames so that is more complicated.

it can be done with a subquery.

  OPTIONAL {
     FILTER(!BOUND(?name))
     {
            SELECT ?protein
                 (GROUP_CONCAT(?fName; separator=', ') as ?name) 
            WHERE{
                 ?protein up:submittedName/up:fullName ?fName .
            } GROUP BY ?protein
      }
   }

This needs to be after the previous OPTIONAL. However, adding it craters performance of the query so its the question if this information is worth it. The third option is to use ?protein rdfs:label ?name . Which comes from the UniRef graph, which has this as a shortcut to be able to regenerate the UniRefXML.

Tutorials and further info

For SPARQL in general I recommend the book Learning SPARQL by Bob du Charme but you can also follow a tutorial I have given in collaboration with the neXtProt for which you can find the materials in this repository. There are also a bunch of videos on youtube about why we provide a SPARQL endpoint for UniProt.

score 3 · Answer 2 · 2017-07-12

Evidence

OPTIONAL {
    [] rdf:object ?annotation ; 
                  up:attribution ?attribution . 
        ?attribution up:evidence ?evidence .
        OPTIONAL {
            ?attribution up:source ?source
        }
    }
}

Not every evidence has a source but when they do they are related via the up:source predicate.

Getting the labels for the ECO code can be done via federated query to the EBI RDF platform which has the full ECO ontology in its OLS part.

SERVICE<https://www.ebi.ac.uk/rdf/services/sparql>{
   ?evidence rdfs:label ?evidenceLabel .
}

Unfortunately when combining it with the above query we run into a bug in the SPARQL engine that we use :(

However, I can hack around it for you by adding this prefix at the top of the query

PREFIX ECO:<http://purl.obolibrary.org/obo/ECO_0000>

and then putting this at the end of the query.

VALUES (?evidenceCode ?evidenceLabel)
{
    {ECO:269 "Inferred from experiment")
    (ECO:314 "Inferred from direct assay")
    (ECO:353 "Inferred from physical interaction")
    (ECO:315 "Inferred from mutant phenotype")
    (ECO:316 "Inferred from genetic interaction")
    (ECO:270 "Inferred from expression pattern")
    (ECO:250 "Inferred from sequence or structural similarity")
    (ECO:266 "Inferred from sequence orthology")
    (ECO:247 "Inferred from sequence alignment")
    (ECO:255 "Inferred from sequence model")
    (ECO:317 "Inferred from genomic context")
    (ECO:318 "Inferred from biological aspect of ancestor")
    (ECO:319 "Inferred from biological aspect of descendant")
    (ECO:320 "Inferred from key residues")
    (ECO:321 "Inferred from rapid divergence")
    (ECO:245 "Inferred from reviewed computational analysis")
    (ECO:304 "Traceable author statement")
    (ECO:303 "Non-traceable author statement")
    (ECO:305 "Inferred by curator")
    (ECO:307 "No biological data available")
    (ECO:501 "Inferred from electronic annotation")
    (ECO:312 "Manually imported")
    (ECO:313 "Automatically imported")
    (ECO:256 "Automatically inferred from sequence model")
    (ECO:244 "Combinatorial evidence used in manual assertion")
    (ECO:213 "Combinatorial evidence used in automatic assertion")
    (ECO:260 "Match to InterPro member signature evidence used in manual assertion")
    (ECO:259 "Match to InterPro member signature evidence used in automatic assertion")
    }
     FILTER(sameTerm(?evidenceCode, ?evidence))
  }

This basically builds a temp table inside the query and matches the labels as in use inside the UniProt.org website code base (that is where I got the list from ;)