Looking For A Command Line Version Of Ncbi Taxonomy 'Stratification' Tool

8

Entering edit mode

13.5 years ago

Daniel ★ 4.0k

Hi all

I am building a pipeline for taxonomically identifying a blast result at each taxa level, with a bespoke reference dataset. I want it to be easily reproduceable, which it is, other than an ugly step where I have to take a list of taxonIDs and put it into the site: http://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi

retrieve the output and carry on.

Does anyone know if I can get the code that is used? Clearly its calling a cgi from somewhere but I cant find it in the ncbi ftp. If not, is there an equivalent? Technically, I could pull apart the names.dmp and nodes.dmp but there is already a ncbi tool so I'm loathed to do that.

Thanks

N.B. The function is turning:

515482
515474

into

515482  |       Nitzschia dubiiformis   |       515482 2857 33852 33851 33850 33849 2836 33634 2759 131567
515474  |       Cocconeis stauroneiformis       |       515474 216715 216714 186023 33850 33849 2836 33634 2759 131567

EDIT: For the record, I have ~30,000 taxIDs that I am retrieving and gawbul's script, while elegant, won't complete. Im looking at Frederic's now for its multi-coring ability.

taxonomy blast pipeline ncbi • 13k views

ADD COMMENT • link updated 2.0 years ago by Ram 45k • written 13.5 years ago by Daniel ★ 4.0k

8

Entering edit mode

13.5 years ago

Steve Moss 2.3k

You can use BioPython to access any of NCBI's Entrez databases, specifically the taxonomy database in your case - http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc116

Or you can do the same sort of thing with BioPerl's Bio::DB::Taxonomy package - http://doc.bioperl.org/bioperl-live/Bio/DB/Taxonomy.html and Bio::Tree::Tree package - http://doc.bioperl.org/bioperl-live/Bio/Tree/Tree.html

Perl example:

use strict;
use warnings;
use Bio::DB::Taxonomy;
use Bio::Tree::Tree;

my @taxonids = ("515482", "515474");
my @lineages = ();

my $db = Bio::DB::Taxonomy->new(-source => 'entrez');

foreach my $taxonid (@taxonids) {
    my $taxon = $db->get_taxon(-taxonid => $taxonid);
    my $tree = Bio::Tree::Tree->new(-node => $taxon);
    my @taxa = $tree->get_nodes;
    my @tids = ();
    foreach my $t (@taxa) {
        unshift(@tids, $t->id());
    }
    push(@lineages, $taxonid . "\t|\t" . $taxon->scientific_name() . "\t|\t" . "@tids");
}

foreach my $lineage (@lineages) {
    print "$lineage\n";
}

Outputs:

515482  |       Nitzschia dubiiformis   |       515482 2857 33852 33851 33850 33849 2836 33634 2759 131567
515474  |       Cocconeis stauroneiformis       |       515474 216715 216714 186023 33850 33849 2836 33634 2759 131567

Python example:

#import entrez module
from Bio import Entrez

# set variables
taxids = [515482, 515474]

# set email
Entrez.email = "youremail@gmail.com"

# traverse ids
for taxid in taxids:
    handle = Entrez.efetch(db="taxonomy", id=taxid, mode="text", rettype="xml")
    records = Entrez.read(handle)
    for taxon in records:
        taxid = taxon["TaxId"]
        name = taxon["ScientificName"]
        tids = []
        for t in taxon["LineageEx"]:
            tids.insert(0, t["TaxId"])
        tids.insert(0, taxid)
        print "%s\t|\t%s\t|\t%s" % (taxid, name, " ".join(tids))

Outputs:

515482  |       Nitzschia dubiiformis   |       515482 2857 33852 33851 33850 33849 2836 33634 2759 131567
515474  |       Cocconeis stauroneiformis       |       515474 216715 216714 186023 33850 33849 2836 33634 2759 131567

These essentially just call http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi, which you could call and parse yourself? Perhaps iterate over the ids in the list, call the url for each and output as you need?

Calling the following for example:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=515482&mode=text&report=xml

Give us the following XML output:

	<?xml version="1.0"?>
	<!DOCTYPE TaxaSet PUBLIC "-//NLM//DTD Taxon, 14th January 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/taxon.dtd">
	<TaxaSet>

	<Taxon>
	<TaxId>515482</TaxId>
	<ScientificName>Nitzschia dubiiformis</ScientificName>
	<ParentTaxId>2857</ParentTaxId>
	<Rank>species</Rank>

	<Division>Plants</Division>
	<GeneticCode>
	<GCId>1</GCId>
	<GCName>Standard</GCName>
	</GeneticCode>
	<MitoGeneticCode>
	<MGCId>1</MGCId>

	<MGCName>Standard</MGCName>
	</MitoGeneticCode>
	<Lineage>cellular organisms; Eukaryota; stramenopiles; Bacillariophyta; Bacillariophyceae; Bacillariophycidae; Bacillariales; Bacillariaceae; Nitzschia</Lineage>
	<LineageEx>
	<Taxon>
	<TaxId>131567</TaxId>
	<ScientificName>cellular organisms</ScientificName>

	<Rank>no rank</Rank>
	</Taxon>
	<Taxon>
	<TaxId>2759</TaxId>
	<ScientificName>Eukaryota</ScientificName>
	<Rank>superkingdom</Rank>
	</Taxon>

	<Taxon>
	<TaxId>33634</TaxId>
	<ScientificName>stramenopiles</ScientificName>
	<Rank>no rank</Rank>
	</Taxon>
	<Taxon>
	<TaxId>2836</TaxId>

	<ScientificName>Bacillariophyta</ScientificName>
	<Rank>phylum</Rank>
	</Taxon>
	<Taxon>
	<TaxId>33849</TaxId>
	<ScientificName>Bacillariophyceae</ScientificName>
	<Rank>class</Rank>

	</Taxon>
	<Taxon>
	<TaxId>33850</TaxId>
	<ScientificName>Bacillariophycidae</ScientificName>
	<Rank>no rank</Rank>
	</Taxon>
	<Taxon>

	<TaxId>33851</TaxId>
	<ScientificName>Bacillariales</ScientificName>
	<Rank>no rank</Rank>
	</Taxon>
	<Taxon>
	<TaxId>33852</TaxId>
	<ScientificName>Bacillariaceae</ScientificName>

	<Rank>family</Rank>
	</Taxon>
	<Taxon>
	<TaxId>2857</TaxId>
	<ScientificName>Nitzschia</ScientificName>
	<Rank>genus</Rank>
	</Taxon>

	</LineageEx>
	<CreateDate>2008/03/28</CreateDate>
	<UpdateDate>2008/03/28</UpdateDate>
	<PubDate>2012/01/08</PubDate>
	</Taxon>

	</TaxaSet>

view raw biostars-16263.xml hosted with ❤ by GitHub

Which you can then parse the lineage tax ids from quite simply. Pierre gives a great example in his post, using an XSLT stylesheet.

ADD COMMENT • link updated 5.8 years ago by Ram 45k • written 13.5 years ago by Steve Moss 2.3k

2

Entering edit mode

It's easy enough to amend these scripts to take the tax ids as input arguments from the command line, to fit into your pipeline nicely. In fact, I've added the updated forms of these scripts to my bitbucket with the names tax_identifier.pl/py - https://bitbucket.org/gawbul/bioinformatics-scripts/src

ADD REPLY • link 10.6 years ago by Steve Moss 2.3k

0

Entering edit mode

You can easily get these scripts to pull a list of tax ids from the command line as arguments and process them in the same way!

ADD REPLY • link 13.5 years ago by Steve Moss 2.3k

0

Entering edit mode

It's easy enough to amend these scripts to take the tax ids as input arguments. In fact, I've added this scripts to my bitbucket repository with the names tax_identifier.pl/py https://bitbucket.org/gawbul/bioinformatics-scripts/src

ADD REPLY • link 10.6 years ago by Steve Moss 2.3k

0

Entering edit mode

in python script,line 12 need to surround ids in quotes: taxids = ["515482", "515474"]

otherwise was getting this error:

 File "taxa.py", line 12, in <module>
    handle = Entrez.efetch(db="taxonomy", id=taxid, mode="text", rettype="xml")
  File "/usr/lib/python2.7/site-packages/biopython-1.64-py2.7-linux-x86_64.egg/Bio/Entrez/__init__.py", line 145, in efetch
    if ids.count(",") >= 200:
AttributeError: 'int' object has no attribute 'count'

ADD REPLY • link updated 5.8 years ago by Ram 45k • written 10.6 years ago by hn57 • 0

0

Entering edit mode

Which Python script? In my https://github.com/gawbul/bioinformatics-scripts repository? Could you submit an issue request on GitHub please?

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by Steve Moss 2.3k

2

Entering edit mode

13.5 years ago

Frédéric Mahé ★ 3.3k

A few weeks ago, in response to a similar question, I posted two simple shell scripts: one that downloads an prepares the NCBI's taxonomic data, and one that takes a list of GIs and returns their complete taxonomy. As it runs locally, I find it easier to parallelize than an efetch-based script.

ADD COMMENT • link updated 5.8 years ago by Ram 45k • written 13.5 years ago by Frédéric Mahé ★ 3.3k

0

Entering edit mode

I get 404 error with that link.

ADD REPLY • link 6.9 years ago by gtrwst9 • 0

0

Entering edit mode

13.5 years ago

Pierre Lindenbaum 166k

The following XSLT stylesheet does the job:


<xsl:stylesheet xmlns:xsl="&lt;a href=" http:="" www.w3.org="" 1999="" XSL="" Transform"="" rel="nofollow">http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output method="text"/>
  <xsl:template match="/">
    <xsl:for-each select="TaxaSet/Taxon">
      <xsl:value-of select="TaxId"/>
      <xsl:text>  |  </xsl:text>
      <xsl:apply-templates select="ScientificName"/>
      <xsl:text>  |  </xsl:text>
      <xsl:value-of select="TaxId"/>
      <xsl:for-each select="LineageEx/Taxon">
        <xsl:sort select="position()" data-type="number" order="descending"/>
        <xsl:text> </xsl:text>
        <xsl:value-of select="TaxId"/>
      </xsl:for-each>
      <xsl:text>
</xsl:text>
    </xsl:for-each>
  </xsl:template>
</xsl:stylesheet>

usage:

xsltproc --novalid  stylesheet.xsl \
   "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=515482,515474&mode=text&report=xml"

Result:

515482  |  Nitzschia dubiiformis  |  515482 2857 33852 33851 33850 33849 2836 33634 2759 131567
515474  |  Cocconeis stauroneiformis  |  515474 216715 216714 186023 33850 33849 2836 33634 2759 13156

and for a large number of IDS:

while read line
  do
     xsltproc --novalid  stylesheet.xsl \
       "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=${line}&mode=text&report=xml"
  done < mylistof_ids.txt

ADD COMMENT • link 13.5 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

6.9 years ago

gtrwst9 • 0