Parsing Ncbi Taxonomic Tree?
5
17
Entering edit mode
10.0 years ago
Prohan ▴ 350

Hi, I'd like to assign taxonomies to some of my BLAST hits to NR. So I have the GIs.

I've figured that the way to do this is by traversing the files in: ftp://ftp.ncbi.nih.gov/pub/taxonomy - specifically: gi_taxid_prot.dmp and taxdmp

Does anyone have any hints on how to do this? I basically don't understand how to parse the actually tree. I'm planning on doing this in Python.

Thanks

ncbi taxonomy python tree • 22k views
1
Entering edit mode

is there any chance to use this script using an array of organism's name instead of gis or taxid?

24
Entering edit mode
10.0 years ago

A couple months ago I wrote a short shell script that does the job:

#!/bin/bash

NAMES="names.dmp"
NODES="nodes.dmp"
GI_TO_TAXID="gi_taxid_nucl.dmp"
TAXONOMY=""
GI="${1}" # Obtain the name corresponding to a taxid or the taxid of the parent taxa get_name_or_taxid() { grep --max-count=1 "^${1}"$'\t' "${2}" | cut --fields="${3}" } # Get the taxid corresponding to the GI number TAXID=$(get_name_or_taxid "${GI}" "${GI_TO_TAXID}" "2")

# Loop until you reach the root of the taxonomy (i.e. taxid = 1)
while [[ "${TAXID}" -gt 1 ]] ; do # Obtain the scientific name corresponding to a taxid NAME=$(get_name_or_taxid "${TAXID}" "${NAMES}" "3")
# Obtain the parent taxa taxid
PARENT=$(get_name_or_taxid "${TAXID}" "${NODES}" "3") # Build the taxonomy path TAXONOMY="${NAME};${TAXONOMY}" TAXID="${PARENT}"
done

echo -e "${GI}\t${TAXONOMY}"

exit 0


For instance, if you have a table of blast results:

cut -d "|" -f 2 myblast.table | sed -e '/^$/d' | grep -v "^#" | while read GI ; do bash get_ncbi_taxonomy.sh "$GI" ; done


It is not very fast, but it can be easily parallelized:

xargs --arg-file=GI.list --max-procs=8 -I '{}' bash get_ncbi_taxonomy.sh '{}'


With 8 cores, you can treat 500-1000 GIs per minute. If you have tens or hundreds of thousand of GIs, it would be more efficient to index everything (python dictionary?).

There is also a companion script that downloads and prepares NCBI's files:

#!/bin/bash

## assignation.

## Variables
NCBI="ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/"
TAXDUMP="taxdump.tar.gz"
TAXID="gi_taxid_nucl.dmp.gz"
NAMES="names.dmp"
NODES="nodes.dmp"
DMP=$(echo {citations,division,gencode,merged,delnodes}.dmp) USELESS_FILES="${TAXDUMP} ${DMP} gc.prt readme.txt" ## Download taxdump rm -rf${USELESS_FILES} "${NODES}" "${NAMES}"
wget "${NCBI}${TAXDUMP}" && \
tar zxvf "${TAXDUMP}" && \ rm -rf${USELESS_FILES}

## Limit search space to scientific names
grep "scientific name" "${NAMES}" > "${NAMES/.dmp/_reduced.dmp}" && \
rm -f "${NAMES}" && \ mv "${NAMES/.dmp/_reduced.dmp}" "${NAMES}" ## Download gi_taxid_nucl rm -f "${TAXID/.gz/}*"
wget "${NCBI}${TAXID}" && \
gunzip "${TAXID}" exit 0 ADD COMMENT 1 Entering edit mode Impressive use of Bash and xargs there! But re-grepping the nodes file is not scalable, as you state. ADD REPLY 1 Entering edit mode If you mean that multiplying concurrent accesses to the same file is not something scalable, you're right. For a very number of GI requests, it would be better to transform back nodes and names files into indexed databases (sqlite or python pickled object). But for my level of use, these shell scripts are more than enough. ADD REPLY 1 Entering edit mode This is really impressive bash scripting. It seems to work great for me. Now just need to understand how it works! Thanks a ton. ADD REPLY 1 Entering edit mode very useful information! ADD REPLY 1 Entering edit mode [SOLVED] Great work, thanks a lot. I have been testing it and I've found a disturbing behavior. As get_name_or_taxid() is getting the first instance of its first argument, it may sometimes pull synonyms or misspellings from names.dmp. ADD REPLY 0 Entering edit mode I never had that problem. I reviewed and updated the above code, but I don't think it would solve your problem. Could you please give an example of problematic GI? ADD REPLY 0 Entering edit mode Yes. I was trying GI 115495057, using gi_taxid_prot.dmp instead gi_taxid_nucl.dmp. The output of your script is: 115495057 biota; Eucarya; Fungi/Metazoa group; Animalia; Eumetazoa; Bilateria; Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostomi; Sarcopterygii; Dipnotetrapodomorpha; Tetrapoda; Amniota; Mammalia; Theria; Eutheria; Boreoeutheria; Laurasiatheria; Cetartiodactyla; Artiodactyla; Pecora; Bovidae; Bovinae; Bos; Bos Tauurus; While the lineage for cattle (taxonomy ID 9913) at NCBI's taxonomy browser is: cellular organisms; Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria; Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostomi; Sarcopterygii; Dipnotetrapodomorpha; Tetrapoda; Amniota; Mammalia; Theria; Eutheria; Boreoeutheria; Laurasiatheria; Cetartiodactyla; Ruminantia; Pecora; Bovidae; Bovinae; Bos; Bos taurus Some of the taxonomic categories are labelled as synonyms or misspellings at names.dmp, and the results I get seem to be the first occurrence in the list independently of its staus. ADD REPLY 0 Entering edit mode I just tried with the GI 115495057, and the output is correct: 115495057 cellular organisms; Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria; Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostomi; Sarcopterygii; Dipnotetrapodomorpha; Tetrapoda; Amniota; Mammalia; Theria; Eutheria; Boreoeutheria; Laurasiatheria; Cetartiodactyla; Ruminantia; Pecora; Bovidae; Bovinae; Bos; Bos taurus Did you apply the companion script that reduces names.dmp to only scientific names? That operation gets rid of all synonyms and misspellings. # limit search space to scientific names NAMES="names.dmp" grep "scientific name"${NAMES} > ${NAMES/.dmp/_reduced.dmp} mv${NAMES/.dmp/_reduced.dmp} ${NAMES}  ADD REPLY 0 Entering edit mode My bad... as I had downloaded the taxdump files already, I stopped reading your post after "There is also a companion script that downloads NCBI's files" and I didn't notice the step to search only for scientific names. It working fine for me now. Again, thanks for the script. ADD REPLY 0 Entering edit mode Hi I executed the above "get_ncbi_taxonomy.sh" & I got an error. Am I missing something? myblast.table contained the following data gi|472256744| gi|461490773| gi|71143482| gi|461490773| I go the following error raghul@raghul-Studio-1749:~/db/tax-dump$ cut -d "|" -f 2 myblast.table | sed -e '/^$/d' | grep -v "^#" | while read GI ; do bash get_ncbi_taxonomy.sh "$GI" ; donetax: line 19: [: : integer expression expected 472256744
get_ncbi_taxonomy.sh: line 19: [: : integer expression expected 461490773
get_ncbi_taxonomy.sh: line 19: [: : integer expression expected 71143482
get_ncbi_taxonomy.sh: line 19: [: : integer expression expected 461490773

thank u raghul

2
Entering edit mode

Hi Raghul, this is a bug caused by tabulations. I corrected the script (using $'\t') to avoid that. ADD REPLY 0 Entering edit mode Hello, the script doesn't work for me. It appears that the GI is correctly returned, but not the TAXONOMY. All ncbi files were downloaded according to the companion script. kschoonv@molfyl2:~> cut -d "|" -f 2 2blastx | sed -e '/^$/d' | grep -v "^#" | while read GI ; do bash get_ncbi_taxonomy.sh "\$GI" ; done
802670096
kschoonv@molfyl2:~>

What's going on here?

0
Entering edit mode

Hello,

I've just tried the scripts and they work correctly (besides a change in NCBI's FTP URL: I updated the companion script). It seems that you are using protein GIs as queries. You need to replace gi_taxid_nucl.dmp  with gi_taxid_prot.dmp.

11
Entering edit mode
10.0 years ago
jhc ★ 2.9k

If you just want to link GIs to taxon names, parse the "gi_taxid_prot.dmp" to extract the taxids of your hits, and translate them to scientific names using the "names.dmp" file included in "taxdump.tar.gz".

If you are also interested in getting the taxonomy tree of the selected species, you will need to parse the parent-child relationships in "nodes.dmp". For this, you could use the ETE Python toolkit to load the whole NCBI taxonomy tree, and then prune it to the selected taxa. Actually, there is an example showing how to do exactly that.

P.D. I would recommend you to use the last ETE version (ete2a1). Some functions are still beta, but pruning and traversing methods are much faster when dealing with such a huge (>500k nodes) trees.

UPDATE!: ete2a1 is no longer maintained, use the main branch "ete2". I have also uploaded to github the basic script that I usually use to query the NCBI taxonomy tree (https://github.com/jhcepas/ncbi_taxonomy).

1
Entering edit mode

This is a very nice tool! It also generates a tabular file containing the information of the hierarchy of the taxonomy of each species that might be used in additional analyses.

5
Entering edit mode
6.3 years ago
jhc ★ 2.9k

The ETE toolkit (v2.3+) allows to query the NCBI taxonomy database in a very easy way. You can dump annotated trees by querying with taxids or species names, or get extended taxa information. There is an API and a command line tool available.

4
Entering edit mode
10.0 years ago

One way would be to parse the nodes.dmp file and keep track of the tree in Python. If you only have a fixed set of taxon ids, you could also paste them into iTOL and use the resulting tree with a Newick parser. Lastly, you could try my fork of the Google Code taxonomy repository. This needs more set-up (SQLAlchemy and a parsed NCBI taxonomy), but then is faster for repeated queries.

2
Entering edit mode
4.7 years ago
-_- ▴ 940

The whole NCBI taxonomy database is not that big. I have written some code to convert NCBI taxdump into lineages identified by tax ids, https://github.com/zyxue/ncbitax2lin. You may find it useful.