Question: Parsing Ncbi Taxonomic Tree?
17
gravatar for Prohan
7.4 years ago by
Prohan350
United States
Prohan350 wrote:

Hi, I'd like to assign taxonomies to some of my BLAST hits to NR. So I have the GIs.

I've figured that the way to do this is by traversing the files in: ftp://ftp.ncbi.nih.gov/pub/taxonomy - specifically: gi_taxid_prot.dmp and taxdmp

Does anyone have any hints on how to do this? I basically don't understand how to parse the actually tree. I'm planning on doing this in Python.

Thanks

ncbi python taxonomy tree • 19k views
ADD COMMENTlink modified 18 months ago by Biostar ♦♦ 20 • written 7.4 years ago by Prohan350
24
gravatar for Frédéric Mahé
7.4 years ago by
France, Montpellier, CIRAD
Frédéric Mahé2.9k wrote:

A couple months ago I wrote a short shell script that does the job:

#!/bin/bash

NAMES="names.dmp"
NODES="nodes.dmp"
GI_TO_TAXID="gi_taxid_nucl.dmp"
TAXONOMY=""
GI="${1}"

# Obtain the name corresponding to a taxid or the taxid of the parent taxa
get_name_or_taxid()
{
    grep --max-count=1 "^${1}"$'\t' "${2}" | cut --fields="${3}"
}

# Get the taxid corresponding to the GI number
TAXID=$(get_name_or_taxid "${GI}" "${GI_TO_TAXID}" "2")

# Loop until you reach the root of the taxonomy (i.e. taxid = 1)
while [[ "${TAXID}" -gt 1 ]] ; do
    # Obtain the scientific name corresponding to a taxid
    NAME=$(get_name_or_taxid "${TAXID}" "${NAMES}" "3")
    # Obtain the parent taxa taxid
    PARENT=$(get_name_or_taxid "${TAXID}" "${NODES}" "3")
    # Build the taxonomy path
    TAXONOMY="${NAME};${TAXONOMY}"
    TAXID="${PARENT}"
done

echo -e "${GI}\t${TAXONOMY}"

exit 0

For instance, if you have a table of blast results:

cut -d "|" -f 2 myblast.table | sed -e '/^$/d' | grep -v "^#" | while read GI ; do bash get_ncbi_taxonomy.sh "$GI" ; done

It is not very fast, but it can be easily parallelized:

xargs --arg-file=GI.list --max-procs=8 -I '{}' bash get_ncbi_taxonomy.sh '{}'

With 8 cores, you can treat 500-1000 GIs per minute. If you have tens or hundreds of thousand of GIs, it would be more efficient to index everything (python dictionary?).

There is also a companion script that downloads and prepares NCBI's files:

#!/bin/bash

## Download NCBI's taxonomic data and GI (GenBank ID) taxonomic
## assignation.

## Variables
NCBI="ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/"
TAXDUMP="taxdump.tar.gz"
TAXID="gi_taxid_nucl.dmp.gz"
NAMES="names.dmp"
NODES="nodes.dmp"
DMP=$(echo {citations,division,gencode,merged,delnodes}.dmp)
USELESS_FILES="${TAXDUMP} ${DMP} gc.prt readme.txt"

## Download taxdump
rm -rf ${USELESS_FILES} "${NODES}" "${NAMES}"
wget "${NCBI}${TAXDUMP}" && \
    tar zxvf "${TAXDUMP}" && \
    rm -rf ${USELESS_FILES}

## Limit search space to scientific names
grep "scientific name" "${NAMES}" > "${NAMES/.dmp/_reduced.dmp}" && \
    rm -f "${NAMES}" && \
    mv "${NAMES/.dmp/_reduced.dmp}" "${NAMES}"

## Download gi_taxid_nucl
rm -f "${TAXID/.gz/}*"
wget "${NCBI}${TAXID}" && \
    gunzip "${TAXID}"

exit 0
ADD COMMENTlink modified 3.7 years ago • written 7.4 years ago by Frédéric Mahé2.9k
1

Impressive use of Bash and xargs there! But re-grepping the nodes file is not scalable, as you state.

ADD REPLYlink written 7.4 years ago by Torst900
1

If you mean that multiplying concurrent accesses to the same file is not something scalable, you're right. For a very number of GI requests, it would be better to transform back nodes and names files into indexed databases (sqlite or python pickled object). But for my level of use, these shell scripts are more than enough.

ADD REPLYlink written 7.2 years ago by Frédéric Mahé2.9k
1

This is really impressive bash scripting. It seems to work great for me. Now just need to understand how it works! Thanks a ton.

ADD REPLYlink written 7.1 years ago by Prohan350
1

very useful information!

ADD REPLYlink written 6.0 years ago by deepthithomaskannan250
1

[SOLVED] Great work, thanks a lot. I have been testing it and I've found a disturbing behavior. As get_name_or_taxid() is getting the first instance of its first argument, it may sometimes pull synonyms or misspellings from names.dmp.

ADD REPLYlink modified 5.0 years ago • written 5.1 years ago by Pablo Sanchez10

I never had that problem. I reviewed and updated the above code, but I don't think it would solve your problem. Could you please give an example of problematic GI?

ADD REPLYlink written 5.0 years ago by Frédéric Mahé2.9k

Yes. I was trying GI 115495057, using gi_taxid_prot.dmp instead gi_taxid_nucl.dmp. The output of your script is:

115495057 biota; Eucarya; Fungi/Metazoa group; Animalia; Eumetazoa; Bilateria; Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostomi; Sarcopterygii; Dipnotetrapodomorpha; Tetrapoda; Amniota; Mammalia; Theria; Eutheria; Boreoeutheria; Laurasiatheria; Cetartiodactyla; Artiodactyla; Pecora; Bovidae; Bovinae; Bos; Bos Tauurus;

While the lineage for cattle (taxonomy ID 9913) at NCBI's taxonomy browser is:

cellular organisms; Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria; Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostomi; Sarcopterygii; Dipnotetrapodomorpha; Tetrapoda; Amniota; Mammalia; Theria; Eutheria; Boreoeutheria; Laurasiatheria; Cetartiodactyla; Ruminantia; Pecora; Bovidae; Bovinae; Bos; Bos taurus

Some of the taxonomic categories are labelled as synonyms or misspellings at names.dmp, and the results I get seem to be the first occurrence in the list independently of its staus.

ADD REPLYlink written 5.0 years ago by Pablo Sanchez10

I just tried with the GI 115495057, and the output is correct:

115495057 cellular organisms; Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria; Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostomi; Sarcopterygii; Dipnotetrapodomorpha; Tetrapoda; Amniota; Mammalia; Theria; Eutheria; Boreoeutheria; Laurasiatheria; Cetartiodactyla; Ruminantia; Pecora; Bovidae; Bovinae; Bos; Bos taurus

Did you apply the companion script that reduces names.dmp to only scientific names? That operation gets rid of all synonyms and misspellings.

# limit search space to scientific names
NAMES="names.dmp"
grep "scientific name" ${NAMES} > ${NAMES/.dmp/_reduced.dmp}
mv ${NAMES/.dmp/_reduced.dmp} ${NAMES}
ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by Frédéric Mahé2.9k

My bad... as I had downloaded the taxdump files already, I stopped reading your post after "There is also a companion script that downloads NCBI's files" and I didn't notice the step to search only for scientific names. It working fine for me now. Again, thanks for the script.

ADD REPLYlink written 5.0 years ago by Pablo Sanchez10

Hi I executed the above "get_ncbi_taxonomy.sh" & I got an error. Am I missing something?

myblast.table contained the following data gi|472256744| gi|461490773| gi|71143482| gi|461490773| I go the following error raghul@raghul-Studio-1749:~/db/tax-dump$ cut -d "|" -f 2 myblast.table | sed -e '/^$/d' | grep -v "^#" | while read GI ; do bash get_ncbi_taxonomy.sh "$GI" ; donetax: line 19: [: : integer expression expected 472256744
get_ncbi_taxonomy.sh: line 19: [: : integer expression expected 461490773
get_ncbi_taxonomy.sh: line 19: [: : integer expression expected 71143482
get_ncbi_taxonomy.sh: line 19: [: : integer expression expected 461490773

thank u raghul

ADD REPLYlink modified 5.9 years ago • written 5.9 years ago by Raghul200
2

Hi Raghul, this is a bug caused by tabulations. I corrected the script (using $'\t') to avoid that.

ADD REPLYlink written 5.9 years ago by Frédéric Mahé2.9k

Hello, the script doesn't work for me. It appears that the GI is correctly returned, but not the TAXONOMY. All ncbi files were downloaded according to the companion script.

kschoonv@molfyl2:~> cut -d "|" -f 2 2blastx | sed -e '/^$/d' | grep -v "^#" | while read GI ; do bash get_ncbi_taxonomy.sh "$GI" ; done
802670096
kschoonv@molfyl2:~>

What's going on here?

ADD REPLYlink written 3.7 years ago by karel.schoonvaere0

Hello,

I've just tried the scripts and they work correctly (besides a change in NCBI's FTP URL: I updated the companion script). It seems that you are using protein GIs as queries. You need to replace gi_taxid_nucl.dmp  with gi_taxid_prot.dmp.

ADD REPLYlink written 3.7 years ago by Frédéric Mahé2.9k
11
gravatar for jhc
7.4 years ago by
jhc2.8k
Germany
jhc2.8k wrote:

If you just want to link GIs to taxon names, parse the "gi_taxid_prot.dmp" to extract the taxids of your hits, and translate them to scientific names using the "names.dmp" file included in "taxdump.tar.gz".

If you are also interested in getting the taxonomy tree of the selected species, you will need to parse the parent-child relationships in "nodes.dmp". For this, you could use the ETE Python toolkit to load the whole NCBI taxonomy tree, and then prune it to the selected taxa. Actually, there is an example showing how to do exactly that.

P.D. I would recommend you to use the last ETE version (ete2a1). Some functions are still beta, but pruning and traversing methods are much faster when dealing with such a huge (>500k nodes) trees.

UPDATE!: ete2a1 is no longer maintained, use the main branch "ete2". I have also uploaded to github the basic script that I usually use to query the NCBI taxonomy tree (https://github.com/jhcepas/ncbi_taxonomy).

ADD COMMENTlink modified 7.0 years ago • written 7.4 years ago by jhc2.8k
1

This is a very nice tool! It also generates a tabular file containing the information of the hierarchy of the taxonomy of each species that might be used in additional analyses.

ADD REPLYlink written 5.4 years ago by Cacau410
4
gravatar for Michael Kuhn
7.4 years ago by
Michael Kuhn5.0k
EMBL Heidelberg
Michael Kuhn5.0k wrote:

One way would be to parse the nodes.dmp file and keep track of the tree in Python. If you only have a fixed set of taxon ids, you could also paste them into iTOL and use the resulting tree with a Newick parser. Lastly, you could try my fork of the Google Code taxonomy repository. This needs more set-up (SQLAlchemy and a parsed NCBI taxonomy), but then is faster for repeated queries.

ADD COMMENTlink written 7.4 years ago by Michael Kuhn5.0k
4
gravatar for jhc
3.7 years ago by
jhc2.8k
Germany
jhc2.8k wrote:

The ETE toolkit (v2.3+) allows to query the NCBI taxonomy database in a very easy way. You can dump annotated trees by querying with taxids or species names, or get extended taxa information. There is an API and a command line tool available. 

 

ADD COMMENTlink written 3.7 years ago by jhc2.8k
2
gravatar for -_-
2.1 years ago by
-_-780
Canada
-_-780 wrote:

The whole NCBI taxonomy database is not that big. I have written some code to convert NCBI taxdump into lineages identified by tax ids, https://github.com/zyxue/ncbitax2lin. You may find it useful.

ADD COMMENTlink written 2.1 years ago by -_-780
1
gravatar for luanax85
4.4 years ago by
luanax8520
European Union
luanax8520 wrote:

is there any chance to use this script using an array of organism's name instead of gis or taxid?

ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by luanax8520
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2161 users visited in the last hour