Question: Download whole dataset from NCBI Taxonomy
3
gravatar for stackf03
3.8 years ago by
stackf0330
stackf0330 wrote:

Hello. I want to know where can I download the NCBI taxonomy data file from the NCBI database?

The file that I am looking should contains the following:​

1. Taxonomy ID

2. Common Name

3. Scientific Name



If anyone can provide me the link, I'd be grateful. Thanks & Regards.

 

 

 

taxonomy ncbi • 6.2k views
ADD COMMENTlink modified 11 days ago by sebastian.zn0 • written 3.8 years ago by stackf0330

Hi there, thank you stackf03 for this thread. I'm in need of an automation to include the TaxaDB in a small thesis project. Hope you're still active members and can help me in the following questions:

  1. NCBI keeps uploading to their ftp address the whole TaxaDB in the fashion you've shown in this thread. Do you know if there's any other source for this data?, better yet, in a different format? Since I need an automated way to import (and update) the taxa section of our DB. The dmp files are hard to handle (NCBI uses MySQL but this dump files are not directly from MySQL

  2. If not another source of the data itself, any piece of software that uses TaxaDB as part of their functioning?. I will give a try to this one Taxadb. Would appreciate if there's another tool around.

  3. The 'common-name' is stored in the names (file), for each name that a tax_id has there's a row for it, each indicates the name class. I comment this in case someone else finds this thread and wonders if the common name is there or not.

Thanks.

ADD REPLYlink written 11 days ago by sebastian.zn0
3
gravatar for Phil S.
3.8 years ago by
Phil S.660
Stuttgart, Germany
Phil S.660 wrote:

This is your site! And the file you want to download is this one.

 

HTH

ADD COMMENTlink written 3.8 years ago by Phil S.660
2

Just to be clear the file linked is not a single file archive.

@stackf03: You would want to take a look at the readme that goes with that dump.

ADD REPLYlink written 3.8 years ago by genomax71k

Does this contains the taxID, scientific name and common name?

ADD REPLYlink written 3.8 years ago by stackf0330

It contains this:

 

nodes.dmp
---------

This file represents taxonomy nodes. The description for each node includes 
the following fields:

	tax_id					-- node id in GenBank taxonomy database
 	parent tax_id				-- parent node id in GenBank taxonomy database
 	rank					-- rank of this node (superkingdom, kingdom, ...) 
 	embl code				-- locus-name prefix; not unique
 	division id				-- see division.dmp file
 	inherited div flag  (1 or 0)		-- 1 if node inherits division from parent
 	genetic code id				-- see gencode.dmp file
 	inherited GC  flag  (1 or 0)		-- 1 if node inherits genetic code from parent
 	mitochondrial genetic code id		-- see gencode.dmp file
 	inherited MGC flag  (1 or 0)		-- 1 if node inherits mitochondrial gencode from parent
 	GenBank hidden flag (1 or 0)            -- 1 if name is suppressed in GenBank entry lineage
 	hidden subtree root flag (1 or 0)       -- 1 if this subtree has no sequence data yet
 	comments				-- free-text comments and citations

names.dmp
---------
Taxonomy names file has these fields:

	tax_id					-- the id of node associated with this name
	name_txt				-- name itself
	unique name				-- the unique variant of this name if name not unique
	name class				-- (synonym, common name, ...)

 

from where you can put your parts together

ADD REPLYlink written 3.8 years ago by Phil S.660

Thanks for this.

So basically, I would need the taxonomy names file which names.dmp !

May I knw the tool you use to open this file please? :)​

ADD REPLYlink written 3.8 years ago by stackf0330

That should be a text file. It would likely be large so you may not want to open it in a standard editor. It would be best to use awk to pull out the fields you need.

ADD REPLYlink written 3.8 years ago by genomax71k

I have managed to open it with sublime text editor. It consists this:

1 | all |  | synonym |
1 | root |  | scientific name |
2 | Bacteria | Bacteria <prokaryote> | scientific name |
2 | Monera | Monera <Bacteria> | in-part |
2 | Procaryotae | Procaryotae <Bacteria> | in-part |
2 | Prokaryota | Prokaryota <Bacteria> | in-part |
2 | Prokaryotae | Prokaryotae <Bacteria> | in-part |
2 | bacteria | bacteria <blast2> | blast name |
2 | eubacteria |  | genbank common name |
2 | not Bacteria Haeckel 1894 |  | synonym |
2 | prokaryote | prokaryote <Bacteria> | in-part |
2 | prokaryotes | prokaryotes <Bacteria> | in-part |
6 | Azorhizobium |  | scientific name |
6 | Azorhizobium Dreyfus et al. 1988 emend. Lang et al. 2013 |  | authority |
6 | Azotirhizobium |  | misspelling |
7 | ATCC 43989 |  | type material |
7 | Azorhizobium caulinodans |  | scientific name |
7 | Azorhizobium caulinodans Dreyfus et al. 1988 |  | synonym |
7 | Azotirhizobium caulinodans |  | equivalent name |
7 | CCUG 26647 |  | type material |
7 | DSM 5975 |  | type material |
7 | IFO 14845 |  | type material |
7 | JCM 20966 |  | type material |
7 | LMG 6465 |  | type material |
7 | NBRC 14845 |  | type material |
7 | ORS 571 |  | type material |
9 | Acyrthosiphon pisum symbiont P |  | includes |
9 | Buchnera aphidicola |  | scientific name |
9 | Buchnera aphidicola Munson et al. 1991 |  | synonym |
10 | "Cellvibrio" Winogradsky 1929 |  | synonym |
10 | Cellvibrio |  | scientific name |
10 | Cellvibrio (ex Winogradsky 1929) Blackall et al. 1986 emend. Humphry et al. 2003 |  | synonym |

 

ADD REPLYlink written 3.8 years ago by stackf0330

Does this make sense?

ADD REPLYlink written 3.8 years ago by stackf0330
1

Looks like you won't get the "common name" from this file. Look at the other files included in the archive. TaxID and scientific names are the first two fields here. Unless you don't need the common name.

Following should give you records (taxID, names) labelled as "scientific names" in names.dmp

$  awk -F "|" '$4 ~ /scientific/ {print $1"\t"$2}' names.dmp > sci_names

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by genomax71k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 863 users visited in the last hour