Looking For A Command Line Version Of Ncbi Taxonomy 'Stratification' Tool
4
8
Entering edit mode
9.7 years ago
Daniel ★ 3.8k

Hi all

I am building a pipeline for taxonomically identifying a blast result at each taxa level, with a bespoke reference dataset. I want it to be easily reproduceable, which it is, other than an ugly step where I have to take a list of taxonIDs and put it into the site:

http://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi

retrieve the output and carry on.

Does anyone know if I can get the code that is used? Clearly its calling a cgi from somewhere but I cant find it in the ncbi ftp. If not, is there an equivalent? Technically, I could pull apart the names.dmp and nodes.dmp but there is already a ncbi tool so I'm loathed to do that.

Thanks


N.B. The function is turning:

515482
515474

into

515482  |       Nitzschia dubiiformis   |       515482 2857 33852 33851 33850 33849 2836 33634 2759 131567
515474  |       Cocconeis stauroneiformis       |       515474 216715 216714 186023 33850 33849 2836 33634 2759 131567

EDIT: For the record, I have ~30,000 taxIDs that I am retrieving and gawbul's script, while elegant, won't complete. Im looking at Frederic's now for its multi-coring ability.

ncbi tool pipeline blast taxonomy • 8.5k views
ADD COMMENT
8
Entering edit mode
9.7 years ago

You can use BioPython to access any of NCBI's Entrez databases, specifically the taxonomy database in your case - http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc116

Or you can do the same sort of thing with BioPerl's Bio::DB::Taxonomy package - http://doc.bioperl.org/bioperl-live/Bio/DB/Taxonomy.html and Bio::Tree::Tree package - http://doc.bioperl.org/bioperl-live/Bio/Tree/Tree.html

Perl example:

use strict;
use warnings;
use Bio::DB::Taxonomy;
use Bio::Tree::Tree;

my @taxonids = ("515482", "515474");
my @lineages = ();

my $db = Bio::DB::Taxonomy->new(-source => 'entrez');

foreach my $taxonid (@taxonids) {
    my $taxon = $db->get_taxon(-taxonid => $taxonid);
    my $tree = Bio::Tree::Tree->new(-node => $taxon);
    my @taxa = $tree->get_nodes;
    my @tids = ();
    foreach my $t (@taxa) {
        unshift(@tids, $t->id());
    }
    push(@lineages, $taxonid . "\t|\t" . $taxon->scientific_name() . "\t|\t" . "@tids");
}

foreach my $lineage (@lineages) {
    print "$lineage\n";
}

Outputs:

515482  |       Nitzschia dubiiformis   |       515482 2857 33852 33851 33850 33849 2836 33634 2759 131567
515474  |       Cocconeis stauroneiformis       |       515474 216715 216714 186023 33850 33849 2836 33634 2759 131567

Python example:

#import entrez module
from Bio import Entrez

# set variables
taxids = [515482, 515474]

# set email
Entrez.email = "youremail@gmail.com"

# traverse ids
for taxid in taxids:
    handle = Entrez.efetch(db="taxonomy", id=taxid, mode="text", rettype="xml")
    records = Entrez.read(handle)
    for taxon in records:
        taxid = taxon["TaxId"]
        name = taxon["ScientificName"]
        tids = []
        for t in taxon["LineageEx"]:
            tids.insert(0, t["TaxId"])
        tids.insert(0, taxid)
        print "%s\t|\t%s\t|\t%s" % (taxid, name, " ".join(tids))

Outputs:

515482  |       Nitzschia dubiiformis   |       515482 2857 33852 33851 33850 33849 2836 33634 2759 131567
515474  |       Cocconeis stauroneiformis       |       515474 216715 216714 186023 33850 33849 2836 33634 2759 131567

These essentially just call http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi, which you could call and parse yourself? Perhaps iterate over the ids in the list, call the url for each and output as you need?

Calling the following for example:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=515482&mode=text&report=xml

Give us the following XML output:

Which you can then parse the lineage tax ids from quite simply. Pierre gives a great example in his post, using an XSLT stylesheet.

ADD COMMENT
2
Entering edit mode

It's easy enough to amend these scripts to take the tax ids as input arguments from the command line, to fit into your pipeline nicely. In fact, I've added the updated forms of these scripts to my bitbucket with the names tax_identifier.pl/py - https://bitbucket.org/gawbul/bioinformatics-scripts/src

ADD REPLY
0
Entering edit mode

You can easily get these scripts to pull a list of tax ids from the command line as arguments and process them in the same way!

ADD REPLY
0
Entering edit mode

It's easy enough to amend these scripts to take the tax ids as input arguments. In fact, I've added this scripts to my bitbucket repository with the names tax_identifier.pl/py https://bitbucket.org/gawbul/bioinformatics-scripts/src

ADD REPLY
0
Entering edit mode

in python script,line 12 need to surround ids in quotes: taxids = ["515482", "515474"]

otherwise was getting this error:

 File "taxa.py", line 12, in <module>
    handle = Entrez.efetch(db="taxonomy", id=taxid, mode="text", rettype="xml")
  File "/usr/lib/python2.7/site-packages/biopython-1.64-py2.7-linux-x86_64.egg/Bio/Entrez/__init__.py", line 145, in efetch
    if ids.count(",") >= 200:
AttributeError: 'int' object has no attribute 'count'
ADD REPLY
0
Entering edit mode

Which Python script? In my https://github.com/gawbul/bioinformatics-scripts repository? Could you submit an issue request on GitHub please?

ADD REPLY
2
Entering edit mode
9.7 years ago

A few weeks ago, in response to a similar question, I posted two simple shell scripts: one that downloads an prepares the NCBI's taxonomic data, and one that takes a list of GIs and returns their complete taxonomy. As it runs locally, I find it easier to parallelize than an efetch-based script.

ADD COMMENT
0
Entering edit mode

I get 404 error with that link.

ADD REPLY
0
Entering edit mode
9.7 years ago

The following XSLT stylesheet does the job:


<xsl:stylesheet xmlns:xsl="&lt;a href=" http:="" www.w3.org="" 1999="" XSL="" Transform"="" rel="nofollow">http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output method="text"/>
  <xsl:template match="/">
    <xsl:for-each select="TaxaSet/Taxon">
      <xsl:value-of select="TaxId"/>
      <xsl:text>  |  </xsl:text>
      <xsl:apply-templates select="ScientificName"/>
      <xsl:text>  |  </xsl:text>
      <xsl:value-of select="TaxId"/>
      <xsl:for-each select="LineageEx/Taxon">
        <xsl:sort select="position()" data-type="number" order="descending"/>
        <xsl:text> </xsl:text>
        <xsl:value-of select="TaxId"/>
      </xsl:for-each>
      <xsl:text>
</xsl:text>
    </xsl:for-each>
  </xsl:template>
</xsl:stylesheet>

usage:

xsltproc --novalid  stylesheet.xsl \
   "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=515482,515474&mode=text&report=xml"

Result:

515482  |  Nitzschia dubiiformis  |  515482 2857 33852 33851 33850 33849 2836 33634 2759 131567
515474  |  Cocconeis stauroneiformis  |  515474 216715 216714 186023 33850 33849 2836 33634 2759 13156

and for a large number of IDS:

while read line
  do
     xsltproc --novalid  stylesheet.xsl \
       "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=${line}&mode=text&report=xml"
  done < mylistof_ids.txt
ADD COMMENT
0
Entering edit mode
3.1 years ago
gtrwst9 • 0

See also the ETE Toolkit:

http://etetoolkit.org/docs/latest/index.html

ADD COMMENT

Login before adding your answer.

Traffic: 1727 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6