1
4
Entering edit mode
7.2 years ago
stackf03 ▴ 40

Hello. I want to know where can I download the NCBI taxonomy data file from the NCBI database?

The file that I am looking should contains the following:​

1. Taxonomy ID
2. Common Name
3. Scientific Name

If anyone can provide me the link, I'd be grateful. Thanks & Regards.

NCBI Taxonomy • 11k views
0
Entering edit mode

Hi there, thank you stackf03 for this thread. I'm in need of an automation to include the TaxaDB in a small thesis project. Hope you're still active members and can help me in the following questions:

1. NCBI keeps uploading to their ftp address the whole TaxaDB in the fashion you've shown in this thread. Do you know if there's any other source for this data?, better yet, in a different format? Since I need an automated way to import (and update) the taxa section of our DB. The dmp files are hard to handle (NCBI uses MySQL but this dump files are not directly from MySQL

2. If not another source of the data itself, any piece of software that uses TaxaDB as part of their functioning?. I will give a try to this one Taxadb. Would appreciate if there's another tool around.

3. The 'common-name' is stored in the names (file), for each name that a tax_id has there's a row for it, each indicates the name class. I comment this in case someone else finds this thread and wonders if the common name is there or not.

Thanks.

3
Entering edit mode
7.2 years ago
Phil S. ▴ 700

This is your site! And the file you want to download is this one.

HTH

2
Entering edit mode

Just to be clear the file linked is not a single file archive.

@stackf03: You would want to take a look at the readme that goes with that dump.

0
Entering edit mode

Does this contains the taxID, scientific name and common name?

0
Entering edit mode

It contains this:

nodes.dmp
---------

This file represents taxonomy nodes. The description for each node includes
the following fields:

tax_id                  -- node id in GenBank taxonomy database
parent tax_id               -- parent node id in GenBank taxonomy database
rank                    -- rank of this node (superkingdom, kingdom, ...)
embl code               -- locus-name prefix; not unique
division id             -- see division.dmp file
inherited div flag  (1 or 0)        -- 1 if node inherits division from parent
genetic code id             -- see gencode.dmp file
inherited GC  flag  (1 or 0)        -- 1 if node inherits genetic code from parent
mitochondrial genetic code id       -- see gencode.dmp file
inherited MGC flag  (1 or 0)        -- 1 if node inherits mitochondrial gencode from parent
GenBank hidden flag (1 or 0)            -- 1 if name is suppressed in GenBank entry lineage
hidden subtree root flag (1 or 0)       -- 1 if this subtree has no sequence data yet

names.dmp
---------
Taxonomy names file has these fields:

tax_id                  -- the id of node associated with this name
name_txt                -- name itself
unique name             -- the unique variant of this name if name not unique
name class              -- (synonym, common name, ...)


from where you can put your parts together

0
Entering edit mode

Thanks for this.

So basically, I would need the taxonomy names file which names.dmp !

May I knw the tool you use to open this file please? :)​

0
Entering edit mode

That should be a text file. It would likely be large so you may not want to open it in a standard editor. It would be best to use awk to pull out the fields you need.

0
Entering edit mode

I have managed to open it with sublime text editor. It consists this:

1 | all |  | synonym |
1 | root |  | scientific name |
2 | Bacteria | Bacteria <prokaryote> | scientific name |
2 | Monera | Monera <Bacteria> | in-part |
2 | Procaryotae | Procaryotae <Bacteria> | in-part |
2 | Prokaryota | Prokaryota <Bacteria> | in-part |
2 | Prokaryotae | Prokaryotae <Bacteria> | in-part |
2 | bacteria | bacteria <blast2> | blast name |
2 | eubacteria |  | genbank common name |
2 | not Bacteria Haeckel 1894 |  | synonym |
2 | prokaryote | prokaryote <Bacteria> | in-part |
2 | prokaryotes | prokaryotes <Bacteria> | in-part |
6 | Azorhizobium |  | scientific name |
6 | Azorhizobium Dreyfus et al. 1988 emend. Lang et al. 2013 |  | authority |
6 | Azotirhizobium |  | misspelling |
7 | ATCC 43989 |  | type material |
7 | Azorhizobium caulinodans |  | scientific name |
7 | Azorhizobium caulinodans Dreyfus et al. 1988 |  | synonym |
7 | Azotirhizobium caulinodans |  | equivalent name |
7 | CCUG 26647 |  | type material |
7 | DSM 5975 |  | type material |
7 | IFO 14845 |  | type material |
7 | JCM 20966 |  | type material |
7 | LMG 6465 |  | type material |
7 | NBRC 14845 |  | type material |
7 | ORS 571 |  | type material |
9 | Acyrthosiphon pisum symbiont P |  | includes |
9 | Buchnera aphidicola |  | scientific name |
9 | Buchnera aphidicola Munson et al. 1991 |  | synonym |
10 | "Cellvibrio" Winogradsky 1929 |  | synonym |
10 | Cellvibrio |  | scientific name |
10 | Cellvibrio (ex Winogradsky 1929) Blackall et al. 1986 emend. Humphry et al. 2003 |  | synonym |

0
Entering edit mode

Does this make sense?

1
Entering edit mode

Looks like you won't get the "common name" from this file. Look at the other files included in the archive. TaxID and scientific names are the first two fields here. Unless you don't need the common name.

Following should give you records (taxID, names) labelled as "scientific names" in names.dmp

$awk -F "|" '$4 ~ /scientific/ {print $1"\t"$2}' names.dmp > sci_names