Question

Ncbi Taxonomy Database In Xml Format

3

Entering edit mode

11.8 years ago

Ctanes ▴ 70

I would like to get the full lineage tree downstream of a phylum (all the subcategories until species level) in xml format from the NCBI Taxonomy database. I tried using ncbi eutils: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=1239&retmode=xml however that search does not provide the children of the nodes. Is there an easy way of obtaining such a file?

Thanks

taxonomy xml • 5.8k views

ADD COMMENT • link updated 6.7 years ago by Biostar 20 • written 11.8 years ago by Ctanes ▴ 70

score 2 · Answer 1 · 2013-09-09

2

Entering edit mode

11.8 years ago

Pierre Lindenbaum 166k

as far as I know, there is no such tree. You can build the whole tree using the ncbi dump. See for example Last Common Ancestor from NCBI Taxonomy using Java

ADD COMMENT • link 11.8 years ago by Pierre Lindenbaum 166k

score 1 · Answer 2 · 2013-11-25

You could use the BioPortal SPARQL endpoint to obtain the children. The following sparql query will obtain the children, grandchildren and greatgrandchildren of your tree. You need to adapt the query to the maximum depth of the tree under scrutiny.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT DISTINCT *
WHERE {
    ?child rdfs:subClassOf <http://purl.obolibrary.org/obo/NCBITaxon_1239>  .
        ?child rdfs:label ?childLabel .
        optional {
           ?grandchild rdfs:subClassOf ?child .
           ?grandchild rdfs:label ?grandchildLabel .
          optional {
            ?greatgrandchild rdfs:subClassOf ?grandchild .
            ?greatgrandchild rdfs:label ?greatgrandchildlabel .
            }
          }
}

This query would only give you the URI and its label. To obtain the full tree, you could combine a simpler SPARQL query with eutils:

The query:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT DISTINCT ?taxonid
WHERE {
    ?child rdfs:subClassOf <http://purl.obolibrary.org/obo/NCBITaxon_1239>  .
        ?child skos:notation ?taxonid .
}

If you run this query in your preferred browser and copy the resulting URL, you can use that URL to iterate over the different subclasses. To programmatically submit SPARQL queries you first need to get an apikey.

The following pipeline would get what you want:

Get the children you need to provide the APIKEY and the Taxonomy ID in the URL (In brackets and capitals):

curl "http://sparql.bioontology.org/sparql?query=PREFIX+omv%3A+%3Chttp%3A%2F%2Fomv.ontoware.org%2F2005%2F05%2Fontology%23%3E%0D%0APREFIX+rdfs%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0D%0APREFIX+skos%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23%3E%0D%0ASELECT+DISTINCT+%3Ftaxonid%0D%0AWHERE+%7B%0D%0A%09%3Fchild+rdfs%3AsubClassOf+%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FNCBITaxon_{$TAXONID}%3E++.%0D%0A++++++++%3Fchild+skos%3Anotation+%3Ftaxonid+.%0D%0A%7D%0D%0A++++++++&apikey={YOUR API KEY HERE}"

Extract the taxon ID of each child
Use eutils to get the xml of that child
Repeat from step 1 until the full tree is processed.