Question

Convert list of Accession Numbers to Full Taxonomy

3

Entering edit mode

5.1 years ago

dylan.lawrence ▴ 90

I swear this question has been asked and never satisfyingly answered for over a decade.

I know the simplest answer is to perform a bunch of Entrez queries, but to quote many infomercials, "There's got to be a better way."

Here's the setup, I have a file of straight accession numbers extracted from a BLAST search. I want to convert these to full taxonomies. i.e.

GCA_000005845.2 --> Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia

Or something similar to that. Is there any approach to this that can be in bulk? I have a copy of the BLAST taxonomy file, but that seems to only be useful if applied during a BLAST search, do I just have to re-do all my searches with taxonomy specified?

accession numbers ncbi taxonomy • 6.0k views

ADD COMMENT • link updated 2.2 years ago by Snorre • 0 • written 5.1 years ago by dylan.lawrence ▴ 90

score 3 · Answer 1 · 2019-03-03

3

Entering edit mode

5.1 years ago

GenoMax 141k

Using NCBI Entrez direct.

$ esearch -db assembly -query "GCA_000005845" | elink -target taxonomy | efetch -format native -mode xml | grep ScientificName | awk -F ">|<" 'BEGIN{ORS=", ";}{print $3;}'

Escherichia coli str. K-12 substr. MG1655, cellular organisms, Bacteria, Proteobacteria, Gammaproteobacteria, Enterobacterales, Enterobacteriaceae, Escherichia, Escherichia coli, Escherichia coli K-12,

If your other accession numbers are not genomic assemblies then you would need to switch databases.

$ esearch -db nuccore -query "NG_047018" | elink -target taxonomy | efetch -format native -mode xml | grep ScientificName | awk -F ">|<" 'BEGIN{ORS=", ";}{print $3;}'

Homo sapiens, cellular organisms, Eukaryota, Opisthokonta, Metazoa, Eumetazoa, Bilateria, Deuterostomia, Chordata, Craniata, Vertebrata, Gnathostomata, Teleostomi, Euteleostomi, Sarcopterygii, Dipnotetrapodomorpha, Tetrapoda, Amniota, Mammalia, Theria, Eutheria, Boreoeutheria, Euarchontoglires, Primates, Haplorrhini, Simiiformes, Catarrhini, Hominoidea, Hominidae, Homininae, Homo,

ADD COMMENT • link 5.1 years ago by GenoMax 141k

1

Entering edit mode

Is there a general database? My accessions have a huge number of prefixes.

ADD REPLY • link 5.1 years ago by dylan.lawrence ▴ 90

0

Entering edit mode

What did you blast against to get these accessions? Most everything that is not a genome/assembly (i.e. non G* numbers) should be covered by the main nucleotide database (nuccore).

ADD REPLY • link 5.1 years ago by GenoMax 141k

0

Entering edit mode

Also available as python package (Biopython). e.g.

from Bio import Entrez
data = Entrez.efetch(db = "nucleotide", id = "NC_003197.2")

ADD REPLY • link 2.2 years ago by Snorre • 0

score 1 · Answer 2 · 2019-03-03

Yeah, I wrote a thing to do this (which really is just doing entries queries behind the scenes).

https://github.com/jrjhealey/PYlogeny

Disclaimer, it’s still version...like....0.0001, so at the moment I only have it working with RefSeq, but it would be easy to generalise it. I’d be happy to take contributions or if you let me know what you need I’ll work on making it more fully fledged (which I intend to do over time anyway).

An alternative approach is to download the tax dump file from NCBI, which would allow you to do faster/more parallel lookups, but does require downloading a newer/up to date database fairly frequently.