Question: Trying to match Accession numbers to Taxonomy IDs
gravatar for New2programming
9 weeks ago by
liverpool university
New2programming0 wrote:

Currently writing a sequence alignment programme in javascript and i want to be able to match the genbank accession numbers to species names and for a list of 500+ accessions. Is there any easy way to do this rather than changing over to python?

ADD COMMENTlink modified 9 weeks ago by Pierre Lindenbaum102k • written 9 weeks ago by New2programming0

in javascript

context needed: is it a cmd-line program or is it a web-page ?

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by Pierre Lindenbaum102k

its on a webpage so i'm work in java then convert to html format

ADD REPLYlink written 9 weeks ago by New2programming0

in java ????? I hope you mean javascript ?!

ADD REPLYlink written 9 weeks ago by Pierre Lindenbaum102k

Java is to javascript as ham is to hamster, right? :-p

ADD REPLYlink written 9 weeks ago by WouterDeCoster24k

yup javascript! as the name suggests this is all new territory for me!

ADD REPLYlink written 8 weeks ago by New2programming0
gravatar for Sej Modha
9 weeks ago by
Sej Modha2.3k
Glasgow, UK
Sej Modha2.3k wrote:

A simplified version of the above example with the edirect unix eutils would be:

esearch -db nuccore -query FJ070571|elink -target taxonomy |efetch -format uid
ADD COMMENTlink written 9 weeks ago by Sej Modha2.3k

Dear community,

I don't think the following question deserve a new post, given that it's largely inspired from Sej Modha answer, thus I am commenting on his answer :

Question title : "Retrieving blast staxids from saccver with ncbi E-utilities"

I am parsing blast+ output (custom outfmt 6 with saccver column).
I am interested in retrieving the staxids info for each hit.
The blast was done against a custom database, thus I cannot request the staxids column directly in the output.
But I can map saccver with staxids.

My issue is that I did not manage to do it efficiently yet, let me explain :

The output contains a large number of hits, representing many different saccver.
Ideally, I need to be able to query thousands of saccver at once with efetch / esearch and get a result where each saccver is associated with the corresponding staxids (a Python dictionary for example).

If I loop on my list of saccver, the result is fine but slow : Example for a single query :

time esearch -db protein \
             -query XP_001778611.1 \
             | elink -target taxonomy \
             | efetch -format uid;

real 0m2.332s

Trying the same as above with 5 queries, hoping that the results are in the same order as the input :
Truth :
saccver, staxids
XP_001778611.1, 3218
XP_006409976.1, 72664
NP_001149654.2, 4577
XP_014499952.1, 3916
XP_016732373.1, 3635

time esearch -db protein \
             -query XP_001778611.1,NP_001149654.2,XP_014499952.1,XP_006409976.1,XP_016732373.1 \
             | elink -target taxonomy \
             | efetch -format uid;

real 0m2.290s

Sadly, staxids results are sorted decreasingly, thus I don't see a way to map them to input saccver list here.

Last thing I tried was querying in the xml protein_db output, but I still fail to map saccver to staxids here :

time efetch -db protein \
            -id XP_001778611.1,XP_006409976.1,NP_001149654.2 \
            -format xml \
            | xtract -pattern Bioseq-set_seq-set \
                     -group Seqdesc_source \
                     -sep "@" \
                     -element Object-id_id \
                     -group Textseq-id \
                     -sep "@" \
                     -element Textseq-id_accession;

3218 72664 4577 XM_001778559 XP_001778611 XM_006409913 XP_006409976 NM_001156182 NP_001149654
real 0m1.204s

Any suggestions ?
Best regards

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by erwan.scaon330

If you have a very large set of IDs, maybe you could run a local search against the accession to taxonomy data using the files available on

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Sej Modha2.3k

That seems clever, I'll try it this afternoon

ADD REPLYlink written 6 weeks ago by erwan.scaon330

Ok, I should be able to handle it from there, thanks for the tips :

time zgrep -f list_5.txt \
           prot.accession2taxid.gz \
           | awk '{print $2"\t"$3}' $_

XP_001778611.1 3218
NP_001149654.2 4577
XP_006409976.1 72664
XP_014499952.1 3916
XP_016732373.1 3635

real 1m33.959s

ADD REPLYlink written 6 weeks ago by erwan.scaon330

Glad to know! An independent awk matching should be quicker than grep.

ADD REPLYlink written 6 weeks ago by Sej Modha2.3k

The last hurdle :

My protein database was downloaded from ncbi ftp :

$ wget*.protein.faa.gz &>> wget.verbose;
$ cat *faa.gz > plant_refseq.faa.gz;

Then, I launch a blastx against this plant_refseq database.

When i query all saccver in my blastx output vs prot.accession2taxid.gz, some are not found.
Here are some examples :

The first saccver problem (XP_004986996.1) is easy to solve :
In my blast, the subject was "XP_004986996.1".
But in the file prot.accession2taxid.gz, it was updated to the latest version = XP_004986996.2.

 $ zgrep 'XP_004986996' prot.accession2taxid.gz;
 accession  accession.version   taxid   gi
 XP_004986996   XP_004986996.2  4555    1256755570

Thus my search should be done on accession (XP_004986996.) and not accession+version (XP_004986996.1).

What's worrying is the last 2 saccver examples :
They cannot be found with the above correction.

I can query them in my initial file :

$ zgrep 'XP_013748334.1' ~/Desktop/RTDG/BISCEm/Data/plant_refseq.faa.gz;
>XP_013748334.1 PREDICTED: uncharacterized protein LOC106451027 isoform X1 [Brassica napus]

If you run an ncbi search with the term "XP_013748334", you find nothing.
If you try with "LOC106451027", you get :
On this page we find :
Accession : XP_013748335
REFSEQ: accession XM_013892881.2

Let's try to query with this newly found accession :

$ zgrep 'XP_013748335' prot.accession2taxid.gz 
XP_013748335    XP_013748335.1  3708    923623754

$ esearch -db protein \
          -query XP_013748335 \
          | elink -target taxonomy \
          | efetch -format uid;

But it seems pretty hard to traceback / implement from my initial blastx result.

I did read, but I still do not figure how prot.accession2taxid was constructed : Is it logical that protein with "uncharacterized", "probable", etc... status are not included (which is the case in my example) ?

To rephrase, it seems that entries in prot.accession2taxid and plant_refseq.faa.gz have not been updated at the same time (things in plant_refseq.faa.gz being older it seems). Is there a way to be on the same page for both?

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by erwan.scaon330

It looks like that record has been removed from I would download dead_prot.accession2taxid files and combine it with the to ensure that you capture all accession numbers.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Sej Modha2.3k

Ok, it seems to be a clever suggestion once again. Thanks ! (I lack a bit of knowledge on how ncbi data are organized / how to query them efficiently)

ADD REPLYlink written 6 weeks ago by erwan.scaon330

and if you think you have pdb accession numbers in the customised dataset then download the and combine them with the other files.

ADD REPLYlink written 6 weeks ago by Sej Modha2.3k
gravatar for Pierre Lindenbaum
9 weeks ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum102k wrote:

Here is a javascript code using CORS / XML / XPATH / NCBI-ESummary . Tested with firefox.

ADD COMMENTlink written 9 weeks ago by Pierre Lindenbaum102k
gravatar for erwan.scaon
9 weeks ago by
Limoges - CBRS - France
erwan.scaon330 wrote:

This may be a starting point for you : Using efetch + xtract (I guess you can call it inside your javascript code ?)

An example with this Pinus taeda nucleotide sequence :

efetch -db nuccore \
       -id $an \
       -format xml \
       | xtract -pattern Bioseq-set_seq-set \
                -division Seq-entry \
                -group Bioseq \
                -if Textseq-id_accession -equals $an \
                -element Textseq-id_accession \
                -branch Bioseq_descr \
                -block Org-ref_db \
                -element Object-id_id;

Output = 3352, which is the taxonomy ID for Pinus taeda (

Ps : This command was not thoroughly tested, thus the efetch parsing might need some polishing to be successful for all possibles cases.

ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by erwan.scaon330
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 657 users visited in the last hour