Question

NCBI Taxonomy Comparison

8

Entering edit mode

13.8 years ago

Yannick Wurm ★ 2.5k

Hi all,

lets say I want to know which taxonomic level groups Tribolium castaneum and Drosophila melanogaster. Insects, right?

Kindof. NCBI Taxonomy gives me the full lineages:

cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa; Eumetazoa; Bilateria; Coelomata; Protostomia; Panarthropoda; Arthropoda; Mandibulata; Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera; Endopterygota; Coleoptera; Polyphaga; Cucujiformia; Tenebrionoidea; Tenebrionidae; Tribolium

cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa; Eumetazoa; Bilateria; Coelomata; Protostomia; Panarthropoda; Arthropoda; Mandibulata; Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; Eremoneura; Cyclorrhapha; Schizophora; Acalyptratae; Ephydroidea; Drosophilidae; Drosophilinae; Drosophilini; Drosophilina; Drosophiliti; Drosophila; Sophophora; melanogaster group; melanogaster subgroup

So the highest-level taxonomic grouping between the two is Endopterygota. That's when Coleoptera and Diptera separate.

Now lets say I have 10 pairs of such species and I want to see how close & distant they are... How can I do this easily? (ie without long literature searches or coding an NCBI Taxonomy parser!?)

Thanks!

yannick

taxonomy comparison tree • 6.1k views

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 13.8 years ago by Yannick Wurm ★ 2.5k

5

Entering edit mode

just a word of caution: the NCBI taxonomy isn't very reliable when it comes to the higher level groupings, e.g. they didn't adopt the new animal taxonomy (grouping e.g. nematodes and insects together).

ADD REPLY • link 13.8 years ago by Michael Kuhn 5.0k

1

Entering edit mode

then where would you go for this kind of information, Michael? thanks, yannick

ADD REPLY • link 13.8 years ago by Yannick Wurm ★ 2.5k

1

Entering edit mode

I ended up combining the NCBI taxonomy manually with the taxonomies from recent papers (Dunn et al. Nature 2008, Rogozin et al. Genome Biology and Evolution 2009)

ADD REPLY • link 13.8 years ago by Michael Kuhn 5.0k

0

Entering edit mode

Another word of caution (+1 for warning against the NCBI taxonomy) is regarding your definition of close & distant. Taxonomy is very biased in splitting taxa that are near us into many levels, while little creepy crawlers are lumped into much more inclusive groups.

ADD REPLY • link 13.1 years ago by Rvosa ▴ 580

Ram · Answer 1 · 2010-06-28

Well, I'm not sure how to do this without at least a little bit of programming. But this is a simple problem. The python code below should get you started:

from itertools import count
s1 = 'cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa; Eumetazoa; Bilateria; Coelomata; Protostomia; Panarthropoda; Arthropoda; Mandibulata; Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera; Endopterygota; Coleoptera; Polyphaga; Cucujiformia; Tenebrionoidea; Tenebrionidae; Tribolium'

s2 = 'cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa; Eumetazoa; Bilateria; Coelomata; Protostomia; Panarthropoda; Arthropoda; Mandibulata; Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; Eremoneura; Cyclorrhapha; Schizophora; Acalyptratae; Ephydroidea; Drosophilidae; Drosophilinae; Drosophilini; Drosophilina; Drosophiliti; Drosophila; Sophophora; melanogaster group; melanogaster subgroup'

def GetClosest(s1, s2):
    """Takes two strings from the NCBI taxonomy and returns the first common ancestor between both organisms"""
    group_set = set(s1.split('; '))
    for num, group in enumerate(s2.split('; ').reverse()):
        if group in group_set:
            return group, num

Essentially you just need to split on the ; and then check to find the first item that is in both lists. The only hiccup is that you need to read the lists backwards.

Hope that helps,

Will

Ram · Answer 2 · 2010-06-29

The following script download the two XML files for both taxons. It extracts the lineage using a XSLT stylesheet. Each lineage is then compared side by side using paste and we count the number of times the taxons were different.

TAX1=$1;
TAX2=$2;
xsltproc --novalid taxcmp.xsl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=${TAX1}&db=taxonomy&retmode=xml"  > tmp1.txt
xsltproc --novalid taxcmp.xsl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=${TAX2}&db=taxonomy&retmode=xml"  > tmp2.txt

paste tmp1.txt tmp2.txt | awk -F "[\t]" '{ if($1==$2) next; if(length($1)>0) i++; if(length($2)>0) i++; } END {print i;}'

The associated stylesheet is:


<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" href="http://www.w3.org/1999/XSL/Transform" rel="nofollow">http://www.w3.org/1999/XSL/Transform" "="" rel="nofollow">http://www.w3.org/1999/XSL/Transform'
        version='1.0'
        >
<xsl:output method="text"/>

<xsl:template match="/">
<xsl:for-each select="TaxaSet/Taxon/LineageEx/Taxon">
<xsl:value-of select="TaxId"/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>

</xsl:stylesheet>

TEST:

sh  taxcmp.sh 9605 9606
1

sh  taxcmp.sh 7070 32351
22

Ram · Answer 3 · 2010-06-30

Ok, if you want to visualize the information, let's still use XSLT with Graphiz Dot; The following stylesheet reads a NCBI-XML file with two taxons and generates an input for dot. It counts the maximum number of nodes in both lineages and calls recursively the template 'recursive' to print each lineage:

Usage:

xsltproc --novalid taxonomy2dot.xsl \
 "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=7070,32351&db=taxonomy&retmode=xml" |\
dot -ofile.jpg -Tjpg

Result:

Update there was a bug in the stylesheet below, I fixed it, but in the following image the last nodes were not printed.

see

alt text

The Stylesheet:


<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" href="http://www.w3.org/1999/XSL/Transform" rel="nofollow">http://www.w3.org/1999/XSL/Transform" "="" rel="nofollow">http://www.w3.org/1999/XSL/Transform'
        version='1.0'
        >
<xsl:output method="text"/>
<xsl:variable name="lineage1" select="/TaxaSet/Taxon[1]/LineageEx/Taxon"/> 
<xsl:variable name="count1" select="count($lineage1)"/>
<xsl:variable name="lineage2" select="/TaxaSet/Taxon[2]/LineageEx/Taxon"/> 
<xsl:variable name="count2" select="count($lineage2)"/>

<xsl:template match="/">
digraph G
{
<xsl:call-template name="recursive">
  <xsl:with-param name="index" select="number(1)"/>
</xsl:call-template>
}
</xsl:template>

<xsl:template match="Taxon">
<xsl:value-of select="concat('Tax',TaxId)"/>[label="<xsl:value-of select="ScientificName"/>"];
</xsl:template>

<xsl:template name="recursive">
<xsl:param name="index"/>
<xsl:variable name="tax1" select="$lineage1[$index]/TaxId"/>
<xsl:variable name="tax2" select="$lineage2[$index]/TaxId"/>

<xsl:choose>
  <xsl:when test="$index &gt; $count1 and $index &gt; $count2 "></xsl:when>
  <xsl:when test="$index &gt; $count1">
   <xsl:apply-templates select="$lineage2[$index]"/>
   <xsl:value-of select="concat('Tax',$lineage2[$index - 1]/TaxId,' -&gt; Tax',$tax2)"/>;
   <xsl:call-template name="recursive">
          <xsl:with-param name="index" select="$index +1"/>
        </xsl:call-template>
  </xsl:when>
  <xsl:when test="$index &gt; $count2">
   <xsl:apply-templates select="$lineage1[$index]"/>
   <xsl:value-of select="concat('Tax',$lineage1[$index - 1]/TaxId,' -&gt; Tax',$tax1)"/>;
  <xsl:call-template name="recursive">
          <xsl:with-param name="index" select="$index +1"/>
        </xsl:call-template>
  </xsl:when>
  <xsl:when test="$tax1 != $tax2">
     <xsl:apply-templates select="$lineage2[$index]"/>
     <xsl:value-of select="concat('Tax',$lineage2[$index - 1]/TaxId,' -&gt; Tax',$tax2)"/>;
     <xsl:apply-templates select="$lineage1[$index]"/>
     <xsl:value-of select="concat('Tax',$lineage1[$index - 1]/TaxId,' -&gt; Tax',$tax1)"/>;

    <xsl:call-template name="recursive">
          <xsl:with-param name="index" select="$index +1"/>
        </xsl:call-template>
     </xsl:when>
  <xsl:otherwise>
    <xsl:apply-templates select="$lineage1[$index]"/>
        <xsl:if test="$index &gt; number(1)">
     <xsl:value-of select="concat('Tax',$lineage1[$index - 1]/TaxId,' -&gt; Tax',$tax1)"/>;
    </xsl:if>
        <xsl:call-template name="recursive">
          <xsl:with-param name="index" select="$index +1"/>
        </xsl:call-template>
  </xsl:otherwise>

</xsl:choose>
</xsl:template>

</xsl:stylesheet>

Ram · Answer 4 · 2010-06-30

Thanks guys for the exhaustive responses... but perhaps the key highlight of my question should have been How can I do this easily and visualize it easily? Without coding

I kind of figured a bullsh** hack that works with MEGAN: (Megan's first mission is to parse metagenomics blast results)

First create a file with one line per species, and a comma-delimited number (the number is irrelevant):

Drosophila melanogaster,10
Aedes aegypti,10
Anopheles gambiae,10
Apis mellifera,10
Solenopsis invicta,10
Nasonia vitripennis,10
Pediculus humanus,10
Acyrthosiphon pisum,10
Bombyx mori,10
Caenorhabditis elegans,10
Tribolium castaneum,10

Then open MEGAN

File Menu -> "Import CSV"

Tree Menu -> "Node labels on"

If you subsequently do Tree Menu -> "Show Intermediate Lablels" twice, you get the following:

alt text

Ram · Answer 5 · 2010-07-01

Here's some code I wrote that does just that:

names are strings that look like: Kingdom;Phylum;Class;Order;Family;Genus;Genus species

def _discrepancy(self):
    ''' Returns integer reflecting taxonomic discrepancy '''
    names1 = names1.split(';')
    names2 = names2.split(';')
    counter = 1
    for i, j in zip(names1, names2):
        if i == j:
            counter += 1
        else:
            return counter

counter represents the first level that the taxonomies differ.

A comparison between a;b;c;d;e;f;g and a;b;c;d;x;y;z would result in a discrepancy of 4.