Convert list of Accession Numbers to Full Taxonomy
2
3
Entering edit mode
5.7 years ago

I swear this question has been asked and never satisfyingly answered for over a decade.

I know the simplest answer is to perform a bunch of Entrez queries, but to quote many infomercials, "There's got to be a better way."

Here's the setup, I have a file of straight accession numbers extracted from a BLAST search. I want to convert these to full taxonomies. i.e.

GCA_000005845.2 --> Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia

Or something similar to that. Is there any approach to this that can be in bulk? I have a copy of the BLAST taxonomy file, but that seems to only be useful if applied during a BLAST search, do I just have to re-do all my searches with taxonomy specified?

accession numbers ncbi taxonomy • 6.6k views
ADD COMMENT
3
Entering edit mode
5.7 years ago
GenoMax 146k

Using NCBI Entrez direct.

$ esearch -db assembly -query "GCA_000005845" | elink -target taxonomy | efetch -format native -mode xml | grep ScientificName | awk -F ">|<" 'BEGIN{ORS=", ";}{print $3;}'

Escherichia coli str. K-12 substr. MG1655, cellular organisms, Bacteria, Proteobacteria, Gammaproteobacteria, Enterobacterales, Enterobacteriaceae, Escherichia, Escherichia coli, Escherichia coli K-12,

If your other accession numbers are not genomic assemblies then you would need to switch databases.

$ esearch -db nuccore -query "NG_047018" | elink -target taxonomy | efetch -format native -mode xml | grep ScientificName | awk -F ">|<" 'BEGIN{ORS=", ";}{print $3;}'

Homo sapiens, cellular organisms, Eukaryota, Opisthokonta, Metazoa, Eumetazoa, Bilateria, Deuterostomia, Chordata, Craniata, Vertebrata, Gnathostomata, Teleostomi, Euteleostomi, Sarcopterygii, Dipnotetrapodomorpha, Tetrapoda, Amniota, Mammalia, Theria, Eutheria, Boreoeutheria, Euarchontoglires, Primates, Haplorrhini, Simiiformes, Catarrhini, Hominoidea, Hominidae, Homininae, Homo,
ADD COMMENT
1
Entering edit mode

Is there a general database? My accessions have a huge number of prefixes.

ADD REPLY
0
Entering edit mode

What did you blast against to get these accessions? Most everything that is not a genome/assembly (i.e. non G* numbers) should be covered by the main nucleotide database (nuccore).

ADD REPLY
0
Entering edit mode

Also available as python package (Biopython). e.g.

from Bio import Entrez
data = Entrez.efetch(db = "nucleotide", id = "NC_003197.2")
ADD REPLY
1
Entering edit mode
5.7 years ago
Joe 21k

Yeah, I wrote a thing to do this (which really is just doing entries queries behind the scenes).

https://github.com/jrjhealey/PYlogeny

Disclaimer, it’s still version...like....0.0001, so at the moment I only have it working with RefSeq, but it would be easy to generalise it. I’d be happy to take contributions or if you let me know what you need I’ll work on making it more fully fledged (which I intend to do over time anyway).

An alternative approach is to download the tax dump file from NCBI, which would allow you to do faster/more parallel lookups, but does require downloading a newer/up to date database fairly frequently.

ADD COMMENT

Login before adding your answer.

Traffic: 1323 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6