Question: Ncbi Taxonomy Database In Xml Format
6.9 years ago by
United States
Ctanes70 wrote:

I would like to get the full lineage tree downstream of a phylum (all the subcategories until species level) in xml format from the NCBI Taxonomy database. I tried using ncbi eutils: however that search does not provide the children of the nodes. Is there an easy way of obtaining such a file?


6.9 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:

as far as I know, there is no such tree. You can build the whole tree using the ncbi dump. See for example Last Common Ancestor from NCBI Taxonomy using Java

6.7 years ago by
Maastricht, the Netherlands
Andra Waagmeester3.2k wrote:

You could use the BioPortal SPARQL endpoint to obtain the children. The following sparql query will obtain the children, grandchildren and greatgrandchildren of your tree. You need to adapt the query to the maximum depth of the tree under scrutiny.

PREFIX rdfs: <>
PREFIX skos: <>
    ?child rdfs:subClassOf <>  .
        ?child rdfs:label ?childLabel .
        optional {
           ?grandchild rdfs:subClassOf ?child .
           ?grandchild rdfs:label ?grandchildLabel .
          optional {
            ?greatgrandchild rdfs:subClassOf ?grandchild .
            ?greatgrandchild rdfs:label ?greatgrandchildlabel .

This query would only give you the URI and its label. To obtain the full tree, you could combine a simpler SPARQL query with eutils:

The query:

PREFIX rdfs: <>
PREFIX skos: <>
    ?child rdfs:subClassOf <>  .
        ?child skos:notation ?taxonid .

If you run this query in your preferred browser and copy the resulting URL, you can use that URL to iterate over the different subclasses. To programmatically submit SPARQL queries you first need to get an apikey.

The following pipeline would get what you want:

  1. Get the children you need to provide the APIKEY and the Taxonomy ID in the URL (In brackets and capitals):

    curl "{$TAXONID}%3E++.%0D%0A++++++++%3Fchild+skos%3Anotation+%3Ftaxonid+.%0D%0A%7D%0D%0A++++++++&apikey={YOUR API KEY HERE}"
  2. Extract the taxon ID of each child

  3. Use eutils to get the xml of that child

  4. Repeat from step 1 until the full tree is processed.

