Question: Looking For A Command Line Version Of Ncbi Taxonomy 'Stratification' Tool
8
gravatar for Daniel
5.9 years ago by
Daniel3.5k
Cardiff University
Daniel3.5k wrote:

Hi all

I am building a pipeline for taxonomically identifying a blast result at each taxa level, with a bespoke reference dataset. I want it to be easily reproduceable, which it is, other than an ugly step where I have to take a list of taxonIDs and put it into the site:

http://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi

retrieve the output and carry on.

Does anyone know if I can get the code that is used? Clearly its calling a cgi from somewhere but I cant find it in the ncbi ftp. If not, is there an equivalent? Technically, I could pull apart the names.dmp and nodes.dmp but there is already a ncbi tool so I'm loathed to do that.

Thanks


N.B. The function is turning:

515482
515474

into

515482  |       Nitzschia dubiiformis   |       515482 2857 33852 33851 33850 33849 2836 33634 2759 131567
515474  |       Cocconeis stauroneiformis       |       515474 216715 216714 186023 33850 33849 2836 33634 2759 131567

EDIT: For the record, I have ~30,000 taxIDs that I am retrieving and gawbul's script, while elegant, won't complete. Im looking at Frederic's now for its multi-coring ability.

ncbi taxonomy pipeline tool blast • 4.5k views
ADD COMMENTlink modified 5.9 years ago by Steve Moss2.2k • written 5.9 years ago by Daniel3.5k
8
gravatar for Steve Moss
5.9 years ago by
Steve Moss2.2k
United Kingdom
Steve Moss2.2k wrote:

You can use BioPython to access any of NCBI's Entrez databases, specifically the taxonomy database in your case - http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc116

Or you can do the same sort of thing with BioPerl's Bio::DB::Taxonomy package - http://doc.bioperl.org/bioperl-live/Bio/DB/Taxonomy.html and Bio::Tree::Tree package - http://doc.bioperl.org/bioperl-live/Bio/Tree/Tree.html

Perl example:

use strict;
use warnings;
use Bio::DB::Taxonomy;
use Bio::Tree::Tree;

my @taxonids = ("515482", "515474");
my @lineages = ();

my $db = Bio::DB::Taxonomy->new(-source => 'entrez');

foreach my $taxonid (@taxonids) {
    my $taxon = $db->get_taxon(-taxonid => $taxonid);
    my $tree = Bio::Tree::Tree->new(-node => $taxon);
    my @taxa = $tree->get_nodes;
    my @tids = ();
    foreach my $t (@taxa) {
        unshift(@tids, $t->id());
    }
    push(@lineages, $taxonid . "\t|\t" . $taxon->scientific_name() . "\t|\t" . "@tids");
}

foreach my $lineage (@lineages) {
    print "$lineage\n";
}

Outputs:

515482  |       Nitzschia dubiiformis   |       515482 2857 33852 33851 33850 33849 2836 33634 2759 131567
515474  |       Cocconeis stauroneiformis       |       515474 216715 216714 186023 33850 33849 2836 33634 2759 131567

Python example:

#import entrez module
from Bio import Entrez

# set variables
taxids = [515482, 515474]

# set email
Entrez.email = "youremail@gmail.com"

# traverse ids
for taxid in taxids:
    handle = Entrez.efetch(db="taxonomy", id=taxid, mode="text", rettype="xml")
    records = Entrez.read(handle)
    for taxon in records:
        taxid = taxon["TaxId"]
        name = taxon["ScientificName"]
        tids = []
        for t in taxon["LineageEx"]:
            tids.insert(0, t["TaxId"])
        tids.insert(0, taxid)
        print "%s\t|\t%s\t|\t%s" % (taxid, name, " ".join(tids))

Outputs:

515482  |       Nitzschia dubiiformis   |       515482 2857 33852 33851 33850 33849 2836 33634 2759 131567
515474  |       Cocconeis stauroneiformis       |       515474 216715 216714 186023 33850 33849 2836 33634 2759 131567

These essentially just call http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi, which you could call and parse yourself? Perhaps iterate over the ids in the list, call the url for each and output as you need?

Calling the following for example:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=515482&mode=text&report=xml

Give us the following XML output:


http://www.ncbi.nlm.nih.gov/entrez/query/DTD/taxon.dtd">
<TaxaSet>

<Taxon>
  <TaxId>515482</TaxId>
  <ScientificName>Nitzschia dubiiformis</ScientificName>
  <ParentTaxId>2857</ParentTaxId>
  <Rank>species</Rank>

  <Division>Plants</Division>
  <GeneticCode>
    <GCId>1</GCId>
    <GCName>Standard</GCName>
  </GeneticCode>
  <MitoGeneticCode>
    <MGCId>1</MGCId>

    <MGCName>Standard</MGCName>
  </MitoGeneticCode>
  <Lineage>cellular organisms; Eukaryota; stramenopiles; Bacillariophyta; Bacillariophyceae; Bacillariophycidae; Bacillariales; Bacillariaceae; Nitzschia</Lineage>
  <LineageEx>
    <Taxon>
      <TaxId>131567</TaxId>
      <ScientificName>cellular organisms</ScientificName>

      <Rank>no rank</Rank>
    </Taxon>
    <Taxon>
      <TaxId>2759</TaxId>
      <ScientificName>Eukaryota</ScientificName>
      <Rank>superkingdom</Rank>
    </Taxon>

    <Taxon>
      <TaxId>33634</TaxId>
      <ScientificName>stramenopiles</ScientificName>
      <Rank>no rank</Rank>
    </Taxon>
    <Taxon>
      <TaxId>2836</TaxId>

      <ScientificName>Bacillariophyta</ScientificName>
      <Rank>phylum</Rank>
    </Taxon>
    <Taxon>
      <TaxId>33849</TaxId>
      <ScientificName>Bacillariophyceae</ScientificName>
      <Rank>class</Rank>

    </Taxon>
    <Taxon>
      <TaxId>33850</TaxId>
      <ScientificName>Bacillariophycidae</ScientificName>
      <Rank>no rank</Rank>
    </Taxon>
    <Taxon>

      <TaxId>33851</TaxId>
      <ScientificName>Bacillariales</ScientificName>
      <Rank>no rank</Rank>
    </Taxon>
    <Taxon>
      <TaxId>33852</TaxId>
      <ScientificName>Bacillariaceae</ScientificName>

      <Rank>family</Rank>
    </Taxon>
    <Taxon>
      <TaxId>2857</TaxId>
      <ScientificName>Nitzschia</ScientificName>
      <Rank>genus</Rank>
    </Taxon>

  </LineageEx>
  <CreateDate>2008/03/28</CreateDate>
  <UpdateDate>2008/03/28</UpdateDate>
  <PubDate>2012/01/08</PubDate>
</Taxon>

</TaxaSet>

Which you can then parse the lineage tax ids from quite simply. Pierre gives a great example in his post, using an XSLT stylesheet.

ADD COMMENTlink modified 5.9 years ago • written 5.9 years ago by Steve Moss2.2k
2

It's easy enough to amend these scripts to take the tax ids as input arguments from the command line, to fit into your pipeline nicely. In fact, I've added the updated forms of these scripts to my bitbucket with the names tax_identifier.pl/py - https://bitbucket.org/gawbul/bioinformatics-scripts/src

ADD REPLYlink modified 3.1 years ago • written 5.9 years ago by Steve Moss2.2k

You can easily get these scripts to pull a list of tax ids from the command line as arguments and process them in the same way!

ADD REPLYlink written 5.9 years ago by Steve Moss2.2k

It's easy enough to amend these scripts to take the tax ids as input arguments. In fact, I've added this scripts to my bitbucket repository with the names tax_identifier.pl/py https://bitbucket.org/gawbul/bioinformatics-scripts/src

ADD REPLYlink modified 3.1 years ago • written 5.9 years ago by Steve Moss2.2k

in python script,line 12 need to surround ids in quotes: taxids = ["515482", "515474"]

otherwise was getting this error:

 File "taxa.py", line 12, in <module>
    handle = Entrez.efetch(db="taxonomy", id=taxid, mode="text", rettype="xml")
  File "/usr/lib/python2.7/site-packages/biopython-1.64-py2.7-linux-x86_64.egg/Bio/Entrez/__init__.py", line 145, in efetch
    if ids.count(",") >= 200:
AttributeError: 'int' object has no attribute 'count'

ADD REPLYlink written 3.1 years ago by hn570

Which Python script? In my https://github.com/gawbul/bioinformatics-scripts repository? Could you submit an issue request on GitHub please?

ADD REPLYlink written 3.1 years ago by Steve Moss2.2k
2
gravatar for Frédéric Mahé
5.9 years ago by
Kaiserslautern, Germany
Frédéric Mahé2.7k wrote:

A few weeks ago, in response to a similar question, I posted two simple shell scripts: one that downloads an prepares the NCBI's taxonomic data, and one that takes a list of GIs and returns their complete taxonomy. As it runs locally, I find it easier to parallelize than an efetch-based script.

ADD COMMENTlink written 5.9 years ago by Frédéric Mahé2.7k
0
gravatar for Pierre Lindenbaum
5.9 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum102k wrote:

The following XSLT stylesheet does the job:


<xsl:stylesheet xmlns:xsl="&lt;a href=" http:="" www.w3.org="" 1999="" XSL="" Transform"="" rel="nofollow">http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output method="text"/>
  <xsl:template match="/">
    <xsl:for-each select="TaxaSet/Taxon">
      <xsl:value-of select="TaxId"/>
      <xsl:text>  |  </xsl:text>
      <xsl:apply-templates select="ScientificName"/>
      <xsl:text>  |  </xsl:text>
      <xsl:value-of select="TaxId"/>
      <xsl:for-each select="LineageEx/Taxon">
        <xsl:sort select="position()" data-type="number" order="descending"/>
        <xsl:text> </xsl:text>
        <xsl:value-of select="TaxId"/>
      </xsl:for-each>
      <xsl:text>
</xsl:text>
    </xsl:for-each>
  </xsl:template>
</xsl:stylesheet>

usage:

xsltproc --novalid  stylesheet.xsl \
   "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=515482,515474&mode=text&report=xml"

Result:

515482  |  Nitzschia dubiiformis  |  515482 2857 33852 33851 33850 33849 2836 33634 2759 131567
515474  |  Cocconeis stauroneiformis  |  515474 216715 216714 186023 33850 33849 2836 33634 2759 13156

and for a large number of IDS:

while read line
  do
     xsltproc --novalid  stylesheet.xsl \
       "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=${line}&mode=text&report=xml"
  done < mylistof_ids.txt
ADD COMMENTlink modified 5.9 years ago • written 5.9 years ago by Pierre Lindenbaum102k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1509 users visited in the last hour