Looking For A Command Line Version Of Ncbi Taxonomy 'Stratification' Tool
4
8
Entering edit mode
9.7 years ago
Daniel ★ 3.8k

Hi all

I am building a pipeline for taxonomically identifying a blast result at each taxa level, with a bespoke reference dataset. I want it to be easily reproduceable, which it is, other than an ugly step where I have to take a list of taxonIDs and put it into the site:

http://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi

retrieve the output and carry on.

Does anyone know if I can get the code that is used? Clearly its calling a cgi from somewhere but I cant find it in the ncbi ftp. If not, is there an equivalent? Technically, I could pull apart the names.dmp and nodes.dmp but there is already a ncbi tool so I'm loathed to do that.

Thanks

N.B. The function is turning:

515482
515474


into

515482  |       Nitzschia dubiiformis   |       515482 2857 33852 33851 33850 33849 2836 33634 2759 131567
515474  |       Cocconeis stauroneiformis       |       515474 216715 216714 186023 33850 33849 2836 33634 2759 131567


EDIT: For the record, I have ~30,000 taxIDs that I am retrieving and gawbul's script, while elegant, won't complete. Im looking at Frederic's now for its multi-coring ability.

ncbi tool pipeline blast taxonomy • 8.5k views
8
Entering edit mode
9.7 years ago

You can use BioPython to access any of NCBI's Entrez databases, specifically the taxonomy database in your case - http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc116

Or you can do the same sort of thing with BioPerl's Bio::DB::Taxonomy package - http://doc.bioperl.org/bioperl-live/Bio/DB/Taxonomy.html and Bio::Tree::Tree package - http://doc.bioperl.org/bioperl-live/Bio/Tree/Tree.html

Perl example:

use strict;
use warnings;
use Bio::DB::Taxonomy;
use Bio::Tree::Tree;

my @taxonids = ("515482", "515474");
my @lineages = ();

my $db = Bio::DB::Taxonomy->new(-source => 'entrez'); foreach my$taxonid (@taxonids) {
my $taxon =$db->get_taxon(-taxonid => $taxonid); my$tree = Bio::Tree::Tree->new(-node => $taxon); my @taxa =$tree->get_nodes;
my @tids = ();
foreach my $t (@taxa) { unshift(@tids,$t->id());
}
push(@lineages, $taxonid . "\t|\t" .$taxon->scientific_name() . "\t|\t" . "@tids");
}

foreach my $lineage (@lineages) { print "$lineage\n";
}


Outputs:

515482  |       Nitzschia dubiiformis   |       515482 2857 33852 33851 33850 33849 2836 33634 2759 131567
515474  |       Cocconeis stauroneiformis       |       515474 216715 216714 186023 33850 33849 2836 33634 2759 131567


Python example:

#import entrez module
from Bio import Entrez

# set variables
taxids = [515482, 515474]

# set email
Entrez.email = "youremail@gmail.com"

# traverse ids
for taxid in taxids:
handle = Entrez.efetch(db="taxonomy", id=taxid, mode="text", rettype="xml")
for taxon in records:
taxid = taxon["TaxId"]
name = taxon["ScientificName"]
tids = []
for t in taxon["LineageEx"]:
tids.insert(0, t["TaxId"])
tids.insert(0, taxid)
print "%s\t|\t%s\t|\t%s" % (taxid, name, " ".join(tids))


Outputs:

515482  |       Nitzschia dubiiformis   |       515482 2857 33852 33851 33850 33849 2836 33634 2759 131567
515474  |       Cocconeis stauroneiformis       |       515474 216715 216714 186023 33850 33849 2836 33634 2759 131567


These essentially just call http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi, which you could call and parse yourself? Perhaps iterate over the ids in the list, call the url for each and output as you need?

Calling the following for example:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=515482&mode=text&report=xml

Give us the following XML output:

Which you can then parse the lineage tax ids from quite simply. Pierre gives a great example in his post, using an XSLT stylesheet.

2
Entering edit mode

It's easy enough to amend these scripts to take the tax ids as input arguments from the command line, to fit into your pipeline nicely. In fact, I've added the updated forms of these scripts to my bitbucket with the names tax_identifier.pl/py - https://bitbucket.org/gawbul/bioinformatics-scripts/src

0
Entering edit mode

You can easily get these scripts to pull a list of tax ids from the command line as arguments and process them in the same way!

0
Entering edit mode

It's easy enough to amend these scripts to take the tax ids as input arguments. In fact, I've added this scripts to my bitbucket repository with the names tax_identifier.pl/py https://bitbucket.org/gawbul/bioinformatics-scripts/src

0
Entering edit mode

in python script,line 12 need to surround ids in quotes: taxids = ["515482", "515474"]

otherwise was getting this error:

 File "taxa.py", line 12, in <module>
handle = Entrez.efetch(db="taxonomy", id=taxid, mode="text", rettype="xml")
File "/usr/lib/python2.7/site-packages/biopython-1.64-py2.7-linux-x86_64.egg/Bio/Entrez/__init__.py", line 145, in efetch
if ids.count(",") >= 200:
AttributeError: 'int' object has no attribute 'count'

0
Entering edit mode

Which Python script? In my https://github.com/gawbul/bioinformatics-scripts repository? Could you submit an issue request on GitHub please?

2
Entering edit mode
9.7 years ago

A few weeks ago, in response to a similar question, I posted two simple shell scripts: one that downloads an prepares the NCBI's taxonomic data, and one that takes a list of GIs and returns their complete taxonomy. As it runs locally, I find it easier to parallelize than an efetch-based script.

0
Entering edit mode

I get 404 error with that link.

0
Entering edit mode
9.7 years ago

The following XSLT stylesheet does the job:


<xsl:stylesheet xmlns:xsl="&lt;a href=" http:="" www.w3.org="" 1999="" XSL="" Transform"="" rel="nofollow">http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:for-each select="TaxaSet/Taxon">
<xsl:value-of select="TaxId"/>
<xsl:text>  |  </xsl:text>
<xsl:apply-templates select="ScientificName"/>
<xsl:text>  |  </xsl:text>
<xsl:value-of select="TaxId"/>
<xsl:for-each select="LineageEx/Taxon">
<xsl:sort select="position()" data-type="number" order="descending"/>
<xsl:text> </xsl:text>
<xsl:value-of select="TaxId"/>
</xsl:for-each>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>


usage:

xsltproc --novalid  stylesheet.xsl \
"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=515482,515474&mode=text&report=xml"


Result:

515482  |  Nitzschia dubiiformis  |  515482 2857 33852 33851 33850 33849 2836 33634 2759 131567
515474  |  Cocconeis stauroneiformis  |  515474 216715 216714 186023 33850 33849 2836 33634 2759 13156


and for a large number of IDS:

while read line
do
xsltproc --novalid  stylesheet.xsl \
"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=\${line}&mode=text&report=xml"
done < mylistof_ids.txt

0
Entering edit mode
3.1 years ago
gtrwst9 • 0

http://etetoolkit.org/docs/latest/index.html