Question: Convert list of Accession Numbers to Full Taxonomy
2
gravatar for dylan.lawrence
21 months ago by
dylan.lawrence30 wrote:

I swear this question has been asked and never satisfyingly answered for over a decade.

I know the simplest answer is to perform a bunch of Entrez queries, but to quote many infomercials, "There's got to be a better way."

Here's the setup, I have a file of straight accession numbers extracted from a BLAST search. I want to convert these to full taxonomies. i.e.

GCA_000005845.2 --> Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia

Or something similar to that. Is there any approach to this that can be in bulk? I have a copy of the BLAST taxonomy file, but that seems to only be useful if applied during a BLAST search, do I just have to re-do all my searches with taxonomy specified?

ADD COMMENTlink modified 21 months ago by GenoMax92k • written 21 months ago by dylan.lawrence30
1
gravatar for GenoMax
21 months ago by
GenoMax92k
United States
GenoMax92k wrote:

Using NCBI Entrez direct.

$ esearch -db assembly -query "GCA_000005845" | elink -target taxonomy | efetch -format native -mode xml | grep ScientificName | awk -F ">|<" 'BEGIN{ORS=", ";}{print $3;}'

Escherichia coli str. K-12 substr. MG1655, cellular organisms, Bacteria, Proteobacteria, Gammaproteobacteria, Enterobacterales, Enterobacteriaceae, Escherichia, Escherichia coli, Escherichia coli K-12,

If your other accession numbers are not genomic assemblies then you would need to switch databases.

$ esearch -db nuccore -query "NG_047018" | elink -target taxonomy | efetch -format native -mode xml | grep ScientificName | awk -F ">|<" 'BEGIN{ORS=", ";}{print $3;}'

Homo sapiens, cellular organisms, Eukaryota, Opisthokonta, Metazoa, Eumetazoa, Bilateria, Deuterostomia, Chordata, Craniata, Vertebrata, Gnathostomata, Teleostomi, Euteleostomi, Sarcopterygii, Dipnotetrapodomorpha, Tetrapoda, Amniota, Mammalia, Theria, Eutheria, Boreoeutheria, Euarchontoglires, Primates, Haplorrhini, Simiiformes, Catarrhini, Hominoidea, Hominidae, Homininae, Homo,
ADD COMMENTlink modified 21 months ago • written 21 months ago by GenoMax92k

Is there a general database? My accessions have a huge number of prefixes.

ADD REPLYlink written 21 months ago by dylan.lawrence30

What did you blast against to get these accessions? Most everything that is not a genome/assembly (i.e. non G* numbers) should be covered by the main nucleotide database (nuccore).

ADD REPLYlink modified 21 months ago • written 21 months ago by GenoMax92k
0
gravatar for Joe
21 months ago by
Joe18k
United Kingdom
Joe18k wrote:

Yeah, I wrote a thing to do this (which really is just doing entries queries behind the scenes).

https://github.com/jrjhealey/PYlogeny

Disclaimer, it’s still version...like....0.0001, so at the moment I only have it working with RefSeq, but it would be easy to generalise it. I’d be happy to take contributions or if you let me know what you need I’ll work on making it more fully fledged (which I intend to do over time anyway).

An alternative approach is to download the tax dump file from NCBI, which would allow you to do faster/more parallel lookups, but does require downloading a newer/up to date database fairly frequently.

ADD COMMENTlink modified 21 months ago • written 21 months ago by Joe18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1516 users visited in the last hour