Question: Convert list of Accession Numbers to Full Taxonomy
0
gravatar for dylan.lawrence
4 months ago by
dylan.lawrence10 wrote:

I swear this question has been asked and never satisfyingly answered for over a decade.

I know the simplest answer is to perform a bunch of Entrez queries, but to quote many infomercials, "There's got to be a better way."

Here's the setup, I have a file of straight accession numbers extracted from a BLAST search. I want to convert these to full taxonomies. i.e.

GCA_000005845.2 --> Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia

Or something similar to that. Is there any approach to this that can be in bulk? I have a copy of the BLAST taxonomy file, but that seems to only be useful if applied during a BLAST search, do I just have to re-do all my searches with taxonomy specified?

ADD COMMENTlink modified 4 months ago by genomax69k • written 4 months ago by dylan.lawrence10
1
gravatar for genomax
4 months ago by
genomax69k
United States
genomax69k wrote:

Using NCBI Entrez direct.

$ esearch -db assembly -query "GCA_000005845" | elink -target taxonomy | efetch -format native -mode xml | grep ScientificName | awk -F ">|<" 'BEGIN{ORS=", ";}{print $3;}'

Escherichia coli str. K-12 substr. MG1655, cellular organisms, Bacteria, Proteobacteria, Gammaproteobacteria, Enterobacterales, Enterobacteriaceae, Escherichia, Escherichia coli, Escherichia coli K-12,

If your other accession numbers are not genomic assemblies then you would need to switch databases.

$ esearch -db nuccore -query "NG_047018" | elink -target taxonomy | efetch -format native -mode xml | grep ScientificName | awk -F ">|<" 'BEGIN{ORS=", ";}{print $3;}'

Homo sapiens, cellular organisms, Eukaryota, Opisthokonta, Metazoa, Eumetazoa, Bilateria, Deuterostomia, Chordata, Craniata, Vertebrata, Gnathostomata, Teleostomi, Euteleostomi, Sarcopterygii, Dipnotetrapodomorpha, Tetrapoda, Amniota, Mammalia, Theria, Eutheria, Boreoeutheria, Euarchontoglires, Primates, Haplorrhini, Simiiformes, Catarrhini, Hominoidea, Hominidae, Homininae, Homo,
ADD COMMENTlink modified 4 months ago • written 4 months ago by genomax69k

Is there a general database? My accessions have a huge number of prefixes.

ADD REPLYlink written 4 months ago by dylan.lawrence10

What did you blast against to get these accessions? Most everything that is not a genome/assembly (i.e. non G* numbers) should be covered by the main nucleotide database (nuccore).

ADD REPLYlink modified 4 months ago • written 4 months ago by genomax69k
0
gravatar for jrj.healey
4 months ago by
jrj.healey13k
United Kingdom
jrj.healey13k wrote:

Yeah, I wrote a thing to do this (which really is just doing entries queries behind the scenes).

https://github.com/jrjhealey/PYlogeny

Disclaimer, it’s still version...like....0.0001, so at the moment I only have it working with RefSeq, but it would be easy to generalise it. I’d be happy to take contributions or if you let me know what you need I’ll work on making it more fully fledged (which I intend to do over time anyway).

An alternative approach is to download the tax dump file from NCBI, which would allow you to do faster/more parallel lookups, but does require downloading a newer/up to date database fairly frequently.

ADD COMMENTlink modified 4 months ago • written 4 months ago by jrj.healey13k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 947 users visited in the last hour