Does A Script Exist That Given A Species Name Will Give You Kingdom Which The Input Belongs To?
Entering edit mode
11.3 years ago
tyler.weirick ▴ 120

Does a script exist that given a species name will give you kingdom which the input belongs to? I am doing data analysis on very large sets of blast results and something like this would be a huge help. I am thinking about writing one myself but if something like this exists I would rather not reinvent the wheel. I am unable to find anything on Google searching with terms phylogeny, script taxonomy, etc. Has anyone heard of a program or program package which has this functionality?

taxonomy phylogeny script • 5.2k views
Entering edit mode
11.3 years ago

from a ncbi taxon id (.eg: 9606/human)

 curl -s "" |\
xmllint --xpath "/TaxaSet/Taxon/LineageEx/Taxon[Rank='kingdom']/ScientificName/text()" -


and from a taxon name ("Homo sapiens"):

$ (xmllint --xpath "/eSearchResult/IdList/Id/text()" " Sapiens[SCIN]"  && echo ) | while read L; do xmllint --xpath "/TaxaSet/Taxon/LineageEx/Taxon[Rank='kingdom']/ScientificName/text()" "${L}&retmode=xml&rettype=full" ; done

Entering edit mode
11.3 years ago
SES 8.6k

I'll mention first that there are already tools for doing phylogenetic classifications of sequences, for example MEGAN (I haven't used this tool personally though). If something like that won't work, it's quite easy to roll your own solution using BioPerl. Here's an example scenario: parse your blast report using BioPerl's Bio::SearchIO, then use the species information in your hits to look up the taxonomic information at NCBI.

#!/usr/bin/env perl

use strict;
use warnings;
#use Bio::SearchIO          # for parsing blast, which we aren't doing  
use Bio::DB::Taxonomy;      # for accessing NCBI's entrez Taxonomy database

## plug in some awesome code to parse your blast report here

my $db = Bio::DB::Taxonomy->new(-source => 'entrez');
my $taxonid = $db->get_taxonid('Homo sapiens');  ## you could get this from your blast report, or an ID of some form...
my $taxon = $db->get_taxon(-taxonid => $taxonid);

print "Taxon ID is ", $taxon->id, "\n";
print "Scientific name is ", $taxon->scientific_name, "\n";
print "Rank is ", $taxon->rank, "\n";
print "Division is ", $taxon->division, "\n";

if (defined $taxonid) { # is your species in the database?
    my $node = $db->get_Taxonomy_Node(-taxonid => $taxonid);
    my $kingdom = $node;
    for (1..25) {
        $kingdom = $db->get_Taxonomy_Node(-taxonid => $kingdom->parent_id);
    print "Kingdom is ",$kingdom->scientific_name,"\n";

Call this, and just execute it with perl This will output:

Taxon ID is 9606
Scientific name is Homo sapiens
Rank is species
Division is Primates
Kingdom is Metazoa

Note that this is just an example and I probably wouldn't traverse the tree this way for classifying certain groups. The reason is that it may not be the most efficient for many searches (this search takes ~3 seconds), and you will have to take a different number of steps back to the same point depending on the lineage. It's possible to select just a single rank of interest (e.g., kingdom), of course, but this illustrates how you can get to any part of the tree you want. I added a comment in the code where I checked if the species is in the database because you'll find that many are not, so don't assume every species is represented. I think Pierre's solution is really cool (I couldn't come up with that), but I have to say that a Bio* approach is probably more reliable (and readable) than trying to construct URLs for each query since there are a lot of tests going on behind the scences in the Perl code above. You can also do the same thing in Biopython or probably any Bio* package and I'd like to see those examples personally because I'm not familiar with those methods.

EDIT: I've found that Bio::DB::EUtilities (or Bio::DB::SoapEUtilities) is faster than my example above, but I still don't think these methods (including Pierre's solution, which is really fast) are ideal for your problem. The reason is that NCBI asks you to limit queries to 3 per second and only run large jobs during certain hours or on weekends. When you say that you have "very large sets of blast results" I'm guessing you mean millions of queries, and that could take anywhere from a week to months to run. A better solution would be to download the taxonomy flat files, change the source in the code above from 'entrez' to 'flatfile' and do the search locally. That way you can split up your blast reports and run many jobs in parallel. You could probably modify Pierre's code and do the same thing with a bash script.


Login before adding your answer.

Traffic: 1663 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6