Question: NCBI Taxonomy Comparison
8
gravatar for Yannick Wurm
9.1 years ago by
Yannick Wurm2.3k
Queen Mary University London
Yannick Wurm2.3k wrote:

Hi all,

lets say I want to know which taxonomic level groups Tribolium castaneum and Drosophila melanogaster. Insects, right?

Kindof. NCBI Taxonomy gives me the full lineages:

cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa; Eumetazoa; Bilateria; Coelomata; Protostomia; Panarthropoda; Arthropoda; Mandibulata; Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera; Endopterygota; Coleoptera; Polyphaga; Cucujiformia; Tenebrionoidea; Tenebrionidae; Tribolium

cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa; Eumetazoa; Bilateria; Coelomata; Protostomia; Panarthropoda; Arthropoda; Mandibulata; Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; Eremoneura; Cyclorrhapha; Schizophora; Acalyptratae; Ephydroidea; Drosophilidae; Drosophilinae; Drosophilini; Drosophilina; Drosophiliti; Drosophila; Sophophora; melanogaster group; melanogaster subgroup

So the highest-level taxonomic grouping between the two is Endopterygota. That's when Coleoptera and Diptera separate.

Now lets say I have 10 pairs of such species and I want to see how close & distant they are... How can I do this easily? (ie without long literature searches or coding an NCBI Taxonomy parser!?)

Thanks!

yannick

comparison tree taxonomy • 4.2k views
ADD COMMENTlink modified 10 months ago by RamRS22k • written 9.1 years ago by Yannick Wurm2.3k
5

just a word of caution: the NCBI taxonomy isn't very reliable when it comes to the higher level groupings, e.g. they didn't adopt the new animal taxonomy (grouping e.g. nematodes and insects together).

ADD REPLYlink written 9.1 years ago by Michael Kuhn5.0k
1

then where would you go for this kind of information, Michael? thanks, yannick

ADD REPLYlink written 9.1 years ago by Yannick Wurm2.3k
1

I ended up combining the NCBI taxonomy manually with the taxonomies from recent papers (Dunn et al. Nature 2008, Rogozin et al. Genome Biology and Evolution 2009)

ADD REPLYlink written 9.1 years ago by Michael Kuhn5.0k

Another word of caution (+1 for warning against the NCBI taxonomy) is regarding your definition of close & distant. Taxonomy is very biased in splitting taxa that are near us into many levels, while little creepy crawlers are lumped into much more inclusive groups.

ADD REPLYlink written 8.3 years ago by Rvosa570
7
gravatar for Will
9.1 years ago by
Will4.5k
United States
Will4.5k wrote:

Well, I'm not sure how to do this without at least a little bit of programming. But this is a simple problem. The python code below should get you started:

from itertools import count
s1 = 'cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa; Eumetazoa; Bilateria; Coelomata; Protostomia; Panarthropoda; Arthropoda; Mandibulata; Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera; Endopterygota; Coleoptera; Polyphaga; Cucujiformia; Tenebrionoidea; Tenebrionidae; Tribolium'

s2 = 'cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa; Eumetazoa; Bilateria; Coelomata; Protostomia; Panarthropoda; Arthropoda; Mandibulata; Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; Eremoneura; Cyclorrhapha; Schizophora; Acalyptratae; Ephydroidea; Drosophilidae; Drosophilinae; Drosophilini; Drosophilina; Drosophiliti; Drosophila; Sophophora; melanogaster group; melanogaster subgroup'

def GetClosest(s1, s2):
    """Takes two strings from the NCBI taxonomy and returns the first common ancestor between both organisms"""
    group_set = set(s1.split('; '))
    for num, group in enumerate(s2.split('; ').reverse()):
        if group in group_set:
            return group, num

Essentially you just need to split on the ; and then check to find the first item that is in both lists. The only hiccup is that you need to read the lists backwards.

Hope that helps,

Will

ADD COMMENTlink modified 10 months ago by RamRS22k • written 9.1 years ago by Will4.5k
6
gravatar for Pierre Lindenbaum
9.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum121k wrote:

The following script download the two XML files for both taxons. It extracts the lineage using a XSLT stylesheet. Each lineage is then compared side by side using paste and we count the number of times the taxons were different.

TAX1=$1;
TAX2=$2;
xsltproc --novalid taxcmp.xsl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=${TAX1}&db=taxonomy&retmode=xml"  > tmp1.txt
xsltproc --novalid taxcmp.xsl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=${TAX2}&db=taxonomy&retmode=xml"  > tmp2.txt

paste tmp1.txt tmp2.txt | awk -F "[\t]" '{ if($1==$2) next; if(length($1)>0) i++; if(length($2)>0) i++; } END {print i;}'

The associated stylesheet is:


<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" href="http://www.w3.org/1999/XSL/Transform" rel="nofollow">http://www.w3.org/1999/XSL/Transform" "="" rel="nofollow">http://www.w3.org/1999/XSL/Transform'
        version='1.0'
        >
<xsl:output method="text"/>

<xsl:template match="/">
<xsl:for-each select="TaxaSet/Taxon/LineageEx/Taxon">
<xsl:value-of select="TaxId"/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>

</xsl:stylesheet>

TEST:

sh  taxcmp.sh 9605 9606
1

sh  taxcmp.sh 7070 32351
22
ADD COMMENTlink modified 10 months ago by RamRS22k • written 9.1 years ago by Pierre Lindenbaum121k

that is some very impressive bash-fu!

ADD REPLYlink written 9.1 years ago by Will4.5k
4
gravatar for Pierre Lindenbaum
9.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum121k wrote:

Ok, if you want to visualize the information, let's still use XSLT with Graphiz Dot; The following stylesheet reads a NCBI-XML file with two taxons and generates an input for dot. It counts the maximum number of nodes in both lineages and calls recursively the template 'recursive' to print each lineage:

Usage:

xsltproc --novalid taxonomy2dot.xsl \
 "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=7070,32351&db=taxonomy&retmode=xml" |\
dot -ofile.jpg -Tjpg

Result:

Update there was a bug in the stylesheet below, I fixed it, but in the following image the last nodes were not printed.

see

alt text

The Stylesheet:


<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" href="http://www.w3.org/1999/XSL/Transform" rel="nofollow">http://www.w3.org/1999/XSL/Transform" "="" rel="nofollow">http://www.w3.org/1999/XSL/Transform'
        version='1.0'
        >
<xsl:output method="text"/>
<xsl:variable name="lineage1" select="/TaxaSet/Taxon[1]/LineageEx/Taxon"/> 
<xsl:variable name="count1" select="count($lineage1)"/>
<xsl:variable name="lineage2" select="/TaxaSet/Taxon[2]/LineageEx/Taxon"/> 
<xsl:variable name="count2" select="count($lineage2)"/>

<xsl:template match="/">
digraph G
{
<xsl:call-template name="recursive">
  <xsl:with-param name="index" select="number(1)"/>
</xsl:call-template>
}
</xsl:template>

<xsl:template match="Taxon">
<xsl:value-of select="concat('Tax',TaxId)"/>[label="<xsl:value-of select="ScientificName"/>"];
</xsl:template>

<xsl:template name="recursive">
<xsl:param name="index"/>
<xsl:variable name="tax1" select="$lineage1[$index]/TaxId"/>
<xsl:variable name="tax2" select="$lineage2[$index]/TaxId"/>

<xsl:choose>
  <xsl:when test="$index &gt; $count1 and $index &gt; $count2 "></xsl:when>
  <xsl:when test="$index &gt; $count1">
   <xsl:apply-templates select="$lineage2[$index]"/>
   <xsl:value-of select="concat('Tax',$lineage2[$index - 1]/TaxId,' -&gt; Tax',$tax2)"/>;
   <xsl:call-template name="recursive">
          <xsl:with-param name="index" select="$index +1"/>
        </xsl:call-template>
  </xsl:when>
  <xsl:when test="$index &gt; $count2">
   <xsl:apply-templates select="$lineage1[$index]"/>
   <xsl:value-of select="concat('Tax',$lineage1[$index - 1]/TaxId,' -&gt; Tax',$tax1)"/>;
  <xsl:call-template name="recursive">
          <xsl:with-param name="index" select="$index +1"/>
        </xsl:call-template>
  </xsl:when>
  <xsl:when test="$tax1 != $tax2">
     <xsl:apply-templates select="$lineage2[$index]"/>
     <xsl:value-of select="concat('Tax',$lineage2[$index - 1]/TaxId,' -&gt; Tax',$tax2)"/>;
     <xsl:apply-templates select="$lineage1[$index]"/>
     <xsl:value-of select="concat('Tax',$lineage1[$index - 1]/TaxId,' -&gt; Tax',$tax1)"/>;

    <xsl:call-template name="recursive">
          <xsl:with-param name="index" select="$index +1"/>
        </xsl:call-template>
     </xsl:when>
  <xsl:otherwise>
    <xsl:apply-templates select="$lineage1[$index]"/>
        <xsl:if test="$index &gt; number(1)">
     <xsl:value-of select="concat('Tax',$lineage1[$index - 1]/TaxId,' -&gt; Tax',$tax1)"/>;
    </xsl:if>
        <xsl:call-template name="recursive">
          <xsl:with-param name="index" select="$index +1"/>
        </xsl:call-template>
  </xsl:otherwise>

</xsl:choose>
</xsl:template>

</xsl:stylesheet>
ADD COMMENTlink modified 10 months ago by RamRS22k • written 9.1 years ago by Pierre Lindenbaum121k
1

very cool :) http://plindenbaum.blogspot.com/2010/06/xsltncbi-taxonomygraphviz-dot.html

ADD REPLYlink written 9.1 years ago by Yannick Wurm2.3k

but might a hack just be to download the xml, grep out the lineage info, then replace the ';' by '" -> "' for graphviz?

ADD REPLYlink written 9.1 years ago by Yannick Wurm2.3k

with XSLT, you make certain that the structure of the document is still the same (DTD) and you don't care about the amount/position of the white spaces/tags.

ADD REPLYlink written 9.0 years ago by Pierre Lindenbaum121k
1
gravatar for Yannick Wurm
9.1 years ago by
Yannick Wurm2.3k
Queen Mary University London
Yannick Wurm2.3k wrote:

Thanks guys for the exhaustive responses... but perhaps the key highlight of my question should have been How can I do this easily and visualize it easily? Without coding

I kind of figured a bullsh** hack that works with MEGAN: (Megan's first mission is to parse metagenomics blast results)

First create a file with one line per species, and a comma-delimited number (the number is irrelevant):

Drosophila melanogaster,10
Aedes aegypti,10
Anopheles gambiae,10
Apis mellifera,10
Solenopsis invicta,10
Nasonia vitripennis,10
Pediculus humanus,10
Acyrthosiphon pisum,10
Bombyx mori,10
Caenorhabditis elegans,10
Tribolium castaneum,10

Then open MEGAN

File Menu -> "Import CSV"

Tree Menu -> "Node labels on"

If you subsequently do Tree Menu -> "Show Intermediate Lablels" twice, you get the following:

alt text

ADD COMMENTlink modified 10 months ago by RamRS22k • written 9.1 years ago by Yannick Wurm2.3k
1

This would also work in iTOL, there you can also paste a list of names and get a tree. I think you have to replace " " by "_", though.

ADD REPLYlink modified 10 months ago by RamRS22k • written 9.1 years ago by Michael Kuhn5.0k
1
gravatar for Science_Robot
9.1 years ago by
Science_Robot1.1k
Gainesville, FL
Science_Robot1.1k wrote:

Here's some code I wrote that does just that:

names are strings that look like: Kingdom;Phylum;Class;Order;Family;Genus;Genus species

def _discrepancy(self):
    ''' Returns integer reflecting taxonomic discrepancy '''
    names1 = names1.split(';')
    names2 = names2.split(';')
    counter = 1
    for i, j in zip(names1, names2):
        if i == j:
            counter += 1
        else:
            return counter

counter represents the first level that the taxonomies differ.

A comparison between a;b;c;d;e;f;g and a;b;c;d;x;y;z would result in a discrepancy of 4.

ADD COMMENTlink modified 10 months ago by RamRS22k • written 9.1 years ago by Science_Robot1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 717 users visited in the last hour