Question: taxonomy comparison
 
7
 
 

Hi all,

lets say I want to know which taxonomic level groups Tribolium castaneum and Drosophila melanogaster. Insects, right?

Kindof. NCBI Taxonomy gives me the full lineages:

cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa; Eumetazoa; Bilateria; Coelomata; Protostomia; Panarthropoda; Arthropoda; Mandibulata; Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera; Endopterygota; Coleoptera; Polyphaga; Cucujiformia; Tenebrionoidea; Tenebrionidae; Tribolium

cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa; Eumetazoa; Bilateria; Coelomata; Protostomia; Panarthropoda; Arthropoda; Mandibulata; Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; Eremoneura; Cyclorrhapha; Schizophora; Acalyptratae; Ephydroidea; Drosophilidae; Drosophilinae; Drosophilini; Drosophilina; Drosophiliti; Drosophila; Sophophora; melanogaster group; melanogaster subgroup

So the highest-level taxonomic grouping between the two is Endopterygota. That's when Coleoptera and Diptera separate.

Now lets say I have 10 pairs of such species and I want to see how close & distant they are... How can I do this easily? (ie without long literature searches or coding an NCBI Taxonomy parser!?)

Thanks! yannick

 
 
 
5

just a word of caution: the NCBI taxonomy isn't very reliable when it comes to the higher level groupings, e.g. they didn't adopt the new animal taxonomy (grouping e.g. nematodes and insects together).

log in to reply • written 22 months ago by Michael Kuhn  437315
 
1

then where would you go for this kind of information, Michael? thanks, yannick

log in to reply • written 22 months ago by Yannick Wurm  141310
 
1

I ended up combining the NCBI taxonomy manually with the taxonomies from recent papers (Dunn et al. Nature 2008, Rogozin et al. Genome Biology and Evolution 2009)

log in to reply • written 22 months ago by Michael Kuhn  437315
 

Another word of caution (+1 for warning against the NCBI taxonomy) is regarding your definition of close & distant. Taxonomy is very biased in splitting taxa that are near us into many levels, while little creepy crawlers are lumped into much more inclusive groups.

log in to reply • written 13 months ago by Rvosa  4527

5 answers

 
1
 
 
 

Thanks guys for the exhaustive responses... but perhaps the key highlight of my question should have been How can I do this easily and visualize it easily? Without coding

I kind of figured a bullsh** hack that works with MEGAN: (Megan's first mission is to parse metagenomics blast results)

First create a file with one line per species, and a comma-delimited number (the number is irrelevant):

Drosophila melanogaster,10
Aedes aegypti,10
Anopheles gambiae,10
Apis mellifera,10
Solenopsis invicta,10
Nasonia vitripennis,10
Pediculus humanus,10
Acyrthosiphon pisum,10
Bombyx mori,10
Caenorhabditis elegans,10
Tribolium castaneum,10

Then open MEGAN (http://biostar.stackexchange.com/questions/111/taxonomy-of-blast-hits)

File Menu -> "Import CSV"

Tree Menu -> "Node labels on"

If you subsequently do Tree Menu -> "Show Intermediate Lablels" twice, you get the following: alt text

 
 
 
1

This would also work in iTOL (http://itol.embl.de), there you can also paste a list of names and get a tree. I think you have to replace " " by "_", though.

log in to reply • written 22 months ago by Michael Kuhn  437315
 
 
6
 
 

Well, I'm not sure how to do this without at least a little bit of programming. But this is a simple problem. The python code below should get you started:

from itertools import count
s1 = 'cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa; Eumetazoa; Bilateria; Coelomata; Protostomia; Panarthropoda; Arthropoda; Mandibulata; Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera; Endopterygota; Coleoptera; Polyphaga; Cucujiformia; Tenebrionoidea; Tenebrionidae; Tribolium'

s2 = 'cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa; Eumetazoa; Bilateria; Coelomata; Protostomia; Panarthropoda; Arthropoda; Mandibulata; Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; Eremoneura; Cyclorrhapha; Schizophora; Acalyptratae; Ephydroidea; Drosophilidae; Drosophilinae; Drosophilini; Drosophilina; Drosophiliti; Drosophila; Sophophora; melanogaster group; melanogaster subgroup'

def GetClosest(s1, s2):
    """Takes two strings from the NCBI taxonomy and returns the first common ancestor between both organisms"""
    group_set = set(s1.split('; '))
    for num, group in enumerate(s2.split('; ').reverse()):
        if group in group_set:
            return group, num

Essentially you just need to split on the ; and then check to find the first item that is in both lists. The only hiccup is that you need to read the lists backwards.

Hope that helps, Will

 
 
 
 
5
 
 

The following script download the two XML files for both taxons. It extracts the lineage using a XSLT stylesheet. Each lineage is then compared side by side using paste and we count the number of times the taxons were different.

TAX1=$1;
TAX2=$2;
xsltproc --novalid taxcmp.xsl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=${TAX1}&db=taxonomy&retmode=xml"  > tmp1.txt
xsltproc --novalid taxcmp.xsl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=${TAX2}&db=taxonomy&retmode=xml"  > tmp2.txt

paste tmp1.txt tmp2.txt | awk -F "[\t]" '{ if($1==$2) next; if(length($1)>0) i++; if(length($2)>0) i++; } END {print i;}'

the associated stylesheet is:

<?xml version='1.0'  encoding="ISO-8859-1" ?>
<xsl:stylesheet
        xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
        version='1.0'
        >
<xsl:output method="text"/>

<xsl:template match="/">
<xsl:for-each select="TaxaSet/Taxon/LineageEx/Taxon">
<xsl:value-of select="TaxId"/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>

</xsl:stylesheet>

TEST:

sh  taxcmp.sh 9605 9606
1

sh  taxcmp.sh 7070 32351
22
 
 
 

that is some very impressive bash-fu!

log in to reply • written 22 months ago by Will  334316
 
 
3
 
 

Ok, if you want to visualize the information, let's still use XSLT with Graphiz Dot; The following stylesheet reads a NCBI-XML file with two taxons and generates an input for dot. It counts the maximum number of nodes in both lineages and calls recursively the template 'recursive' to print each lineage:

Usage:

xsltproc --novalid taxonomy2dot.xsl \
 "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=7070,32351&db=taxonomy&retmode=xml" |\
dot -ofile.jpg -Tjpg

Result:

Update there was a bug in the stylesheet below, I fixed it, but in the following image the last nodes were not printed.

see http://twitpic.com/219389 alt text

The Stylesheet:

<?xml version='1.0'  encoding="ISO-8859-1" ?>
<xsl:stylesheet
        xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
        version='1.0'
        >
<xsl:output method="text"/>
<xsl:variable name="lineage1" select="/TaxaSet/Taxon[1]/LineageEx/Taxon"/> 
<xsl:variable name="count1" select="count($lineage1)"/>
<xsl:variable name="lineage2" select="/TaxaSet/Taxon[2]/LineageEx/Taxon"/> 
<xsl:variable name="count2" select="count($lineage2)"/>

<xsl:template match="/">
digraph G
{
<xsl:call-template name="recursive">
  <xsl:with-param name="index" select="number(1)"/>
</xsl:call-template>
}
</xsl:template>

<xsl:template match="Taxon">
<xsl:value-of select="concat('Tax',TaxId)"/>[label=&quot;<xsl:value-of select="ScientificName"/>&quot;];
</xsl:template>

<xsl:template name="recursive">
<xsl:param name="index"/>
<xsl:variable name="tax1" select="$lineage1[$index]/TaxId"/>
<xsl:variable name="tax2" select="$lineage2[$index]/TaxId"/>

<xsl:choose>
  <xsl:when test="$index &gt; $count1 and $index &gt; $count2 "></xsl:when>
  <xsl:when test="$index &gt; $count1">
   <xsl:apply-templates select="$lineage2[$index]"/>
   <xsl:value-of select="concat('Tax',$lineage2[$index - 1]/TaxId,' -&gt; Tax',$tax2)"/>;
   <xsl:call-template name="recursive">
          <xsl:with-param name="index" select="$index +1"/>
        </xsl:call-template>
  </xsl:when>
  <xsl:when test="$index &gt; $count2">
   <xsl:apply-templates select="$lineage1[$index]"/>
   <xsl:value-of select="concat('Tax',$lineage1[$index - 1]/TaxId,' -&gt; Tax',$tax1)"/>;
  <xsl:call-template name="recursive">
          <xsl:with-param name="index" select="$index +1"/>
        </xsl:call-template>
  </xsl:when>
  <xsl:when test="$tax1 != $tax2">
     <xsl:apply-templates select="$lineage2[$index]"/>
     <xsl:value-of select="concat('Tax',$lineage2[$index - 1]/TaxId,' -&gt; Tax',$tax2)"/>;
     <xsl:apply-templates select="$lineage1[$index]"/>
     <xsl:value-of select="concat('Tax',$lineage1[$index - 1]/TaxId,' -&gt; Tax',$tax1)"/>;

    <xsl:call-template name="recursive">
          <xsl:with-param name="index" select="$index +1"/>
        </xsl:call-template>
     </xsl:when>
  <xsl:otherwise>
    <xsl:apply-templates select="$lineage1[$index]"/>
        <xsl:if test="$index &gt; number(1)">
     <xsl:value-of select="concat('Tax',$lineage1[$index - 1]/TaxId,' -&gt; Tax',$tax1)"/>;
    </xsl:if>
        <xsl:call-template name="recursive">
          <xsl:with-param name="index" select="$index +1"/>
        </xsl:call-template>
  </xsl:otherwise>

</xsl:choose>
</xsl:template>

</xsl:stylesheet>
 
 
 
1

very cool :) http://plindenbaum.blogspot.com/2010/06/xsltncbi-taxonomygraphviz-dot.html

log in to reply • written 22 months ago by Yannick Wurm  141310
 

but might a hack just be to download the xml, grep out the lineage info, then replace the ';' by '" -> "' for graphviz?

log in to reply • written 22 months ago by Yannick Wurm  141310
 

with XSLT, you make certain that the structure of the document is still the same (DTD) and you don't care about the amount/position of the white spaces/tags.

log in to reply • written 22 months ago by Pierre Lindenbaum ♦♦ 351432768
 
 
1
 
 

Here's some code I wrote that does just that:

names are strings that look like: Kingdom;Phylum;Class;Order;Family;Genus;Genus species

def _discrepancy(self):
    ''' Returns integer reflecting taxonomic discrepancy '''
    names1 = names1.split(';')
    names2 = names2.split(';')
    counter = 1
    for i, j in zip(names1, names2):
        if i == j:
            counter += 1
        else:
            return counter

counter represents the first level that the taxonomies differ.

A comparison between a;b;c;d;e;f;g and a;b;c;d;x;y;z would result in a discrepancy of 4.

 
 
 
Log in to add a post