Ncbi Taxonomy Database In Xml Format
2
3
Entering edit mode
11.2 years ago
Ctanes ▴ 70

I would like to get the full lineage tree downstream of a phylum (all the subcategories until species level) in xml format from the NCBI Taxonomy database. I tried using ncbi eutils: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=1239&retmode=xml however that search does not provide the children of the nodes. Is there an easy way of obtaining such a file?

Thanks

taxonomy xml • 5.4k views
ADD COMMENT
2
Entering edit mode
11.2 years ago

as far as I know, there is no such tree. You can build the whole tree using the ncbi dump. See for example Last Common Ancestor from NCBI Taxonomy using Java

ADD COMMENT
1
Entering edit mode
11.0 years ago

You could use the BioPortal SPARQL endpoint to obtain the children. The following sparql query will obtain the children, grandchildren and greatgrandchildren of your tree. You need to adapt the query to the maximum depth of the tree under scrutiny.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT DISTINCT *
WHERE {
    ?child rdfs:subClassOf <http://purl.obolibrary.org/obo/NCBITaxon_1239>  .
        ?child rdfs:label ?childLabel .
        optional {
           ?grandchild rdfs:subClassOf ?child .
           ?grandchild rdfs:label ?grandchildLabel .
          optional {
            ?greatgrandchild rdfs:subClassOf ?grandchild .
            ?greatgrandchild rdfs:label ?greatgrandchildlabel .
            }
          }
}

This query would only give you the URI and its label. To obtain the full tree, you could combine a simpler SPARQL query with eutils:

The query:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT DISTINCT ?taxonid
WHERE {
    ?child rdfs:subClassOf <http://purl.obolibrary.org/obo/NCBITaxon_1239>  .
        ?child skos:notation ?taxonid .
}

If you run this query in your preferred browser and copy the resulting URL, you can use that URL to iterate over the different subclasses. To programmatically submit SPARQL queries you first need to get an apikey.

The following pipeline would get what you want:

  1. Get the children you need to provide the APIKEY and the Taxonomy ID in the URL (In brackets and capitals):

    curl "http://sparql.bioontology.org/sparql?query=PREFIX+omv%3A+%3Chttp%3A%2F%2Fomv.ontoware.org%2F2005%2F05%2Fontology%23%3E%0D%0APREFIX+rdfs%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0D%0APREFIX+skos%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23%3E%0D%0ASELECT+DISTINCT+%3Ftaxonid%0D%0AWHERE+%7B%0D%0A%09%3Fchild+rdfs%3AsubClassOf+%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FNCBITaxon_{$TAXONID}%3E++.%0D%0A++++++++%3Fchild+skos%3Anotation+%3Ftaxonid+.%0D%0A%7D%0D%0A++++++++&apikey={YOUR API KEY HERE}"
    
  2. Extract the taxon ID of each child

  3. Use eutils to get the xml of that child

  4. Repeat from step 1 until the full tree is processed.

ADD COMMENT

Login before adding your answer.

Traffic: 1099 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6