Convert list of Accession Numbers to Full Taxonomy
2
2
Entering edit mode
3.2 years ago

I know the simplest answer is to perform a bunch of Entrez queries, but to quote many infomercials, "There's got to be a better way."

Here's the setup, I have a file of straight accession numbers extracted from a BLAST search. I want to convert these to full taxonomies. i.e.

GCA_000005845.2 --> Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia

Or something similar to that. Is there any approach to this that can be in bulk? I have a copy of the BLAST taxonomy file, but that seems to only be useful if applied during a BLAST search, do I just have to re-do all my searches with taxonomy specified?

accession numbers ncbi taxonomy • 3.5k views
3
Entering edit mode
3.2 years ago
GenoMax 115k

Using NCBI Entrez direct.

$esearch -db assembly -query "GCA_000005845" | elink -target taxonomy | efetch -format native -mode xml | grep ScientificName | awk -F ">|<" 'BEGIN{ORS=", ";}{print$3;}'

Escherichia coli str. K-12 substr. MG1655, cellular organisms, Bacteria, Proteobacteria, Gammaproteobacteria, Enterobacterales, Enterobacteriaceae, Escherichia, Escherichia coli, Escherichia coli K-12,


If your other accession numbers are not genomic assemblies then you would need to switch databases.

$esearch -db nuccore -query "NG_047018" | elink -target taxonomy | efetch -format native -mode xml | grep ScientificName | awk -F ">|<" 'BEGIN{ORS=", ";}{print$3;}'

Homo sapiens, cellular organisms, Eukaryota, Opisthokonta, Metazoa, Eumetazoa, Bilateria, Deuterostomia, Chordata, Craniata, Vertebrata, Gnathostomata, Teleostomi, Euteleostomi, Sarcopterygii, Dipnotetrapodomorpha, Tetrapoda, Amniota, Mammalia, Theria, Eutheria, Boreoeutheria, Euarchontoglires, Primates, Haplorrhini, Simiiformes, Catarrhini, Hominoidea, Hominidae, Homininae, Homo,

0
Entering edit mode

Is there a general database? My accessions have a huge number of prefixes.

0
Entering edit mode

What did you blast against to get these accessions? Most everything that is not a genome/assembly (i.e. non G* numbers) should be covered by the main nucleotide database (nuccore).

0
Entering edit mode

Also available as python package (Biopython). e.g.

from Bio import Entrez
data = Entrez.efetch(db = "nucleotide", id = "NC_003197.2")

0
Entering edit mode
3.2 years ago
Joe 20k

Yeah, I wrote a thing to do this (which really is just doing entries queries behind the scenes).

https://github.com/jrjhealey/PYlogeny

Disclaimer, it’s still version...like....0.0001, so at the moment I only have it working with RefSeq, but it would be easy to generalise it. I’d be happy to take contributions or if you let me know what you need I’ll work on making it more fully fledged (which I intend to do over time anyway).