NCBI Taxonomy Comparison
5
8
Entering edit mode
13.8 years ago
Yannick Wurm ★ 2.5k

Hi all,

lets say I want to know which taxonomic level groups Tribolium castaneum and Drosophila melanogaster. Insects, right?

Kindof. NCBI Taxonomy gives me the full lineages:

cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa; Eumetazoa; Bilateria; Coelomata; Protostomia; Panarthropoda; Arthropoda; Mandibulata; Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera; Endopterygota; Coleoptera; Polyphaga; Cucujiformia; Tenebrionoidea; Tenebrionidae; Tribolium

cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa; Eumetazoa; Bilateria; Coelomata; Protostomia; Panarthropoda; Arthropoda; Mandibulata; Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; Eremoneura; Cyclorrhapha; Schizophora; Acalyptratae; Ephydroidea; Drosophilidae; Drosophilinae; Drosophilini; Drosophilina; Drosophiliti; Drosophila; Sophophora; melanogaster group; melanogaster subgroup

So the highest-level taxonomic grouping between the two is Endopterygota. That's when Coleoptera and Diptera separate.

Now lets say I have 10 pairs of such species and I want to see how close & distant they are... How can I do this easily? (ie without long literature searches or coding an NCBI Taxonomy parser!?)

Thanks!

yannick

taxonomy comparison tree • 6.1k views
ADD COMMENT
5
Entering edit mode

just a word of caution: the NCBI taxonomy isn't very reliable when it comes to the higher level groupings, e.g. they didn't adopt the new animal taxonomy (grouping e.g. nematodes and insects together).

ADD REPLY
1
Entering edit mode

then where would you go for this kind of information, Michael? thanks, yannick

ADD REPLY
1
Entering edit mode

I ended up combining the NCBI taxonomy manually with the taxonomies from recent papers (Dunn et al. Nature 2008, Rogozin et al. Genome Biology and Evolution 2009)

ADD REPLY
0
Entering edit mode

Another word of caution (+1 for warning against the NCBI taxonomy) is regarding your definition of close & distant. Taxonomy is very biased in splitting taxa that are near us into many levels, while little creepy crawlers are lumped into much more inclusive groups.

ADD REPLY
7
Entering edit mode
13.8 years ago
Will 4.5k

Well, I'm not sure how to do this without at least a little bit of programming. But this is a simple problem. The python code below should get you started:

from itertools import count
s1 = 'cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa; Eumetazoa; Bilateria; Coelomata; Protostomia; Panarthropoda; Arthropoda; Mandibulata; Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera; Endopterygota; Coleoptera; Polyphaga; Cucujiformia; Tenebrionoidea; Tenebrionidae; Tribolium'

s2 = 'cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa; Eumetazoa; Bilateria; Coelomata; Protostomia; Panarthropoda; Arthropoda; Mandibulata; Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; Eremoneura; Cyclorrhapha; Schizophora; Acalyptratae; Ephydroidea; Drosophilidae; Drosophilinae; Drosophilini; Drosophilina; Drosophiliti; Drosophila; Sophophora; melanogaster group; melanogaster subgroup'

def GetClosest(s1, s2):
    """Takes two strings from the NCBI taxonomy and returns the first common ancestor between both organisms"""
    group_set = set(s1.split('; '))
    for num, group in enumerate(s2.split('; ').reverse()):
        if group in group_set:
            return group, num

Essentially you just need to split on the ; and then check to find the first item that is in both lists. The only hiccup is that you need to read the lists backwards.

Hope that helps,

Will

ADD COMMENT
6
Entering edit mode
13.8 years ago

The following script download the two XML files for both taxons. It extracts the lineage using a XSLT stylesheet. Each lineage is then compared side by side using paste and we count the number of times the taxons were different.

TAX1=$1;
TAX2=$2;
xsltproc --novalid taxcmp.xsl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=${TAX1}&db=taxonomy&retmode=xml"  > tmp1.txt
xsltproc --novalid taxcmp.xsl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=${TAX2}&db=taxonomy&retmode=xml"  > tmp2.txt

paste tmp1.txt tmp2.txt | awk -F "[\t]" '{ if($1==$2) next; if(length($1)>0) i++; if(length($2)>0) i++; } END {print i;}'

The associated stylesheet is:


<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" href="http://www.w3.org/1999/XSL/Transform" rel="nofollow">http://www.w3.org/1999/XSL/Transform" "="" rel="nofollow">http://www.w3.org/1999/XSL/Transform'
        version='1.0'
        >
<xsl:output method="text"/>

<xsl:template match="/">
<xsl:for-each select="TaxaSet/Taxon/LineageEx/Taxon">
<xsl:value-of select="TaxId"/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>

</xsl:stylesheet>

TEST:

sh  taxcmp.sh 9605 9606
1

sh  taxcmp.sh 7070 32351
22
ADD COMMENT
0
Entering edit mode

that is some very impressive bash-fu!

ADD REPLY
4
Entering edit mode
13.8 years ago

Ok, if you want to visualize the information, let's still use XSLT with Graphiz Dot; The following stylesheet reads a NCBI-XML file with two taxons and generates an input for dot. It counts the maximum number of nodes in both lineages and calls recursively the template 'recursive' to print each lineage:

Usage:

xsltproc --novalid taxonomy2dot.xsl \
 "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=7070,32351&db=taxonomy&retmode=xml" |\
dot -ofile.jpg -Tjpg

Result:

Update there was a bug in the stylesheet below, I fixed it, but in the following image the last nodes were not printed.

see

alt text

The Stylesheet:


<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" href="http://www.w3.org/1999/XSL/Transform" rel="nofollow">http://www.w3.org/1999/XSL/Transform" "="" rel="nofollow">http://www.w3.org/1999/XSL/Transform'
        version='1.0'
        >
<xsl:output method="text"/>
<xsl:variable name="lineage1" select="/TaxaSet/Taxon[1]/LineageEx/Taxon"/> 
<xsl:variable name="count1" select="count($lineage1)"/>
<xsl:variable name="lineage2" select="/TaxaSet/Taxon[2]/LineageEx/Taxon"/> 
<xsl:variable name="count2" select="count($lineage2)"/>

<xsl:template match="/">
digraph G
{
<xsl:call-template name="recursive">
  <xsl:with-param name="index" select="number(1)"/>
</xsl:call-template>
}
</xsl:template>

<xsl:template match="Taxon">
<xsl:value-of select="concat('Tax',TaxId)"/>[label="<xsl:value-of select="ScientificName"/>"];
</xsl:template>

<xsl:template name="recursive">
<xsl:param name="index"/>
<xsl:variable name="tax1" select="$lineage1[$index]/TaxId"/>
<xsl:variable name="tax2" select="$lineage2[$index]/TaxId"/>

<xsl:choose>
  <xsl:when test="$index &gt; $count1 and $index &gt; $count2 "></xsl:when>
  <xsl:when test="$index &gt; $count1">
   <xsl:apply-templates select="$lineage2[$index]"/>
   <xsl:value-of select="concat('Tax',$lineage2[$index - 1]/TaxId,' -&gt; Tax',$tax2)"/>;
   <xsl:call-template name="recursive">
          <xsl:with-param name="index" select="$index +1"/>
        </xsl:call-template>
  </xsl:when>
  <xsl:when test="$index &gt; $count2">
   <xsl:apply-templates select="$lineage1[$index]"/>
   <xsl:value-of select="concat('Tax',$lineage1[$index - 1]/TaxId,' -&gt; Tax',$tax1)"/>;
  <xsl:call-template name="recursive">
          <xsl:with-param name="index" select="$index +1"/>
        </xsl:call-template>
  </xsl:when>
  <xsl:when test="$tax1 != $tax2">
     <xsl:apply-templates select="$lineage2[$index]"/>
     <xsl:value-of select="concat('Tax',$lineage2[$index - 1]/TaxId,' -&gt; Tax',$tax2)"/>;
     <xsl:apply-templates select="$lineage1[$index]"/>
     <xsl:value-of select="concat('Tax',$lineage1[$index - 1]/TaxId,' -&gt; Tax',$tax1)"/>;

    <xsl:call-template name="recursive">
          <xsl:with-param name="index" select="$index +1"/>
        </xsl:call-template>
     </xsl:when>
  <xsl:otherwise>
    <xsl:apply-templates select="$lineage1[$index]"/>
        <xsl:if test="$index &gt; number(1)">
     <xsl:value-of select="concat('Tax',$lineage1[$index - 1]/TaxId,' -&gt; Tax',$tax1)"/>;
    </xsl:if>
        <xsl:call-template name="recursive">
          <xsl:with-param name="index" select="$index +1"/>
        </xsl:call-template>
  </xsl:otherwise>

</xsl:choose>
</xsl:template>

</xsl:stylesheet>
ADD COMMENT
1
Entering edit mode
ADD REPLY
0
Entering edit mode

but might a hack just be to download the xml, grep out the lineage info, then replace the ';' by '" -> "' for graphviz?

ADD REPLY
0
Entering edit mode

with XSLT, you make certain that the structure of the document is still the same (DTD) and you don't care about the amount/position of the white spaces/tags.

ADD REPLY
1
Entering edit mode
13.8 years ago
Yannick Wurm ★ 2.5k

Thanks guys for the exhaustive responses... but perhaps the key highlight of my question should have been How can I do this easily and visualize it easily? Without coding

I kind of figured a bullsh** hack that works with MEGAN: (Megan's first mission is to parse metagenomics blast results)

First create a file with one line per species, and a comma-delimited number (the number is irrelevant):

Drosophila melanogaster,10
Aedes aegypti,10
Anopheles gambiae,10
Apis mellifera,10
Solenopsis invicta,10
Nasonia vitripennis,10
Pediculus humanus,10
Acyrthosiphon pisum,10
Bombyx mori,10
Caenorhabditis elegans,10
Tribolium castaneum,10

Then open MEGAN

File Menu -> "Import CSV"

Tree Menu -> "Node labels on"

If you subsequently do Tree Menu -> "Show Intermediate Lablels" twice, you get the following:

alt text

ADD COMMENT
1
Entering edit mode

This would also work in iTOL, there you can also paste a list of names and get a tree. I think you have to replace " " by "_", though.

ADD REPLY
1
Entering edit mode
13.8 years ago
Science_Robot ★ 1.1k

Here's some code I wrote that does just that:

names are strings that look like: Kingdom;Phylum;Class;Order;Family;Genus;Genus species

def _discrepancy(self):
    ''' Returns integer reflecting taxonomic discrepancy '''
    names1 = names1.split(';')
    names2 = names2.split(';')
    counter = 1
    for i, j in zip(names1, names2):
        if i == j:
            counter += 1
        else:
            return counter

counter represents the first level that the taxonomies differ.

A comparison between a;b;c;d;e;f;g and a;b;c;d;x;y;z would result in a discrepancy of 4.

ADD COMMENT

Login before adding your answer.

Traffic: 2122 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6