I'm trying to get taxonomic lineage from UniProt with the following SPARQL query (based on this and this answers):
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix taxon: <http://purl.uniprot.org/taxonomy/>
prefix : <http://purl.uniprot.org/core/>
select ?ancestor ?name ?rank ?part_of_lineage
where {
taxon:9597 rdfs:subClassOf ?ancestor .
?ancestor :scientificName ?name ;
:partOfLineage ?part_of_lineage ;
:rank ?rank .
} order by ?rank
This query yields 14 entries:
ancestor name rank part_of_lineage
taxon:40674 Mammalia :Class true
taxon:9604 Hominidae :Family true
taxon:9596 Pan :Genus true
taxon:314293 Simiiformes :Infraorder false
taxon:33208 Metazoa :Kingdom true
taxon:9443 Primates :Order true
taxon:9526 Catarrhini :Parvorder true
taxon:7711 Chordata :Phylum true
taxon:207598 Homininae :Subfamily false
taxon:376913 Haplorrhini :Suborder true
taxon:89593 Craniata :Subphylum true
taxon:314295 Hominoidea :Superfamily false
taxon:2759 Eukaryota :Superkingdom true
taxon:314146 Euarchontoglires :Superorder true
You can try it with YASGUI.
Questions
Note, that unlike in the referred answer, I used
rdfs:subClassOf
without+
, because if I userdfs:subClassOf+
, I get this error message from UniProt:Exception:virtuoso.jdbc4.VirtuosoException: TN...: Exceeded 1000000000 bytes in transitive temp memory. use t_distinct, t_max or more T_MAX_memory options to limit the search or increase the pool
Is it a bug in their storage backend or I'm misusing
rdfs:subClassOf+
?As far as I understand, the
rdfs:subClassOf
relationship is _semantically_ transitive, but it should connect only directly related entities. So if you want to get direct ancestor, you can use it one, if you want to get all ancestors, you can use "property paths" feature withrdfs:subClassOf+
.But as far as I see from the results above and this query:
describe <http://purl.uniprot.org/taxonomy/9597> from <http://sparql.uniprot.org/taxonomy>
each node in the UniProt taxonomy graph is a subclass of many other nodes. Why is it so and how can I get just the direct parent of a given taxon in this situation?
Having many ancestors, is there a way to order them (using SPARQL, without postprocessing results) _by taxonomic rank_ (not lexicographically as in the above query)? This would solve the previous question.
If you open _Pan paniscus_
9597
from the example above on UniProt, you will see that its lineage is much longer, but some nodes in it are grey. How is this lineage on the UniProt website is related to the results of the query?If you check the NCBI Taxonomy, the _abbreviated_ lineage is also 14 nodes:
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Pan
But not all of them coincide! So what do I get in those results? Some random subset of the lineage?
Finally, what does the
:partOfLineage
property mean? Documentation says:True for taxa that can appear as part of an organism's lineage
But I don't understand what it means. Aren't all nodes parts of some lineage?
P.S. I read UniProt Taxonomy and Taxonomic lineage documentation. But it doesn't answer on my questions.
UPDATE Regarding my claim in the question 5.
Here is the lineage from UniProt (9597):
- ✔︎ Eukaryota
- ✘ Opisthokonta
- ✔︎ Metazoa
- ✘ Eumetazoa
- ✘ Bilateria
- ✘ Deuterostomia
- ✔︎ Chordata
- ✔︎ Craniata
- ✘ Vertebrata
- ✘ Gnathostomata
- ✘ Teleostomi
- ✘ Euteleostomi
- ✘ Sarcopterygii
- ✘ Dipnotetrapodomorpha
- ✘ Tetrapoda
- ✘ Amniota
- ✔︎ Mammalia
- ✘ Theria
- ✘ Eutheria
- ✘ Boreoeutheria
- ✔︎ Euarchontoglires
- ✔︎ Primates
- ✔︎ Haplorrhini
- ✔︎ Simiiformes
- ✔︎ Catarrhini
- ✔︎ Hominoidea
- ✔︎ Hominidae
- ✔︎ Homininae
- ✔︎ Pan
Those in bold are the ones with :partOfLineage true
. The checkmarks/crosses on the left mean that this taxon is present/absent in the query result. Note, that it contains both types of nodes (not only from the abbreviated linage).