Question

Trying to match Accession numbers to Taxonomy IDs

0

Entering edit mode

6.5 years ago

New2programming • 0

Currently writing a sequence alignment programme in javascript and i want to be able to match the genbank accession numbers to species names and for a list of 500+ accessions. Is there any easy way to do this rather than changing over to python?

alignment gene genbank conversion taxonomy • 4.9k views

ADD COMMENT • link updated 6.5 years ago by Pierre Lindenbaum 161k • written 6.5 years ago by New2programming • 0

0

Entering edit mode

in javascript

context needed: is it a cmd-line program or is it a web-page ?

ADD REPLY • link 6.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

its on a webpage so i'm work in java then convert to html format

ADD REPLY • link 6.5 years ago by New2programming • 0

0

Entering edit mode

in java ????? I hope you mean javascript ?!

ADD REPLY • link 6.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Java is to javascript as ham is to hamster, right? :-p

ADD REPLY • link 6.5 years ago by WouterDeCoster 47k

0

Entering edit mode

yup javascript! as the name suggests this is all new territory for me!

ADD REPLY • link 6.5 years ago by New2programming • 0

score 1 · Answer 1 · 2017-10-09

1

Entering edit mode

6.5 years ago

Sej Modha 5.3k

A simplified version of the above example with the edirect unix eutils would be:

esearch -db nuccore -query FJ070571|elink -target taxonomy |efetch -format uid

ADD COMMENT • link 6.5 years ago by Sej Modha 5.3k

0

Entering edit mode

Dear community,

I don't think the following question deserve a new post, given that it's largely inspired from Sej Modha answer, thus I am commenting on his answer :

Question title : "Retrieving blast staxids from saccver with ncbi E-utilities"

I am parsing blast+ output (custom outfmt 6 with saccver column).
I am interested in retrieving the staxids info for each hit.
The blast was done against a custom database, thus I cannot request the staxids column directly in the output.
But I can map saccver with staxids.

My issue is that I did not manage to do it efficiently yet, let me explain :

The output contains a large number of hits, representing many different saccver.
Ideally, I need to be able to query thousands of saccver at once with efetch / esearch and get a result where each saccver is associated with the corresponding staxids (a Python dictionary for example).

If I loop on my list of saccver, the result is fine but slow : Example for a single query :

time esearch -db protein \
             -query XP_001778611.1 \
             | elink -target taxonomy \
             | efetch -format uid;

3218
real 0m2.332s

Trying the same as above with 5 queries, hoping that the results are in the same order as the input :
Truth :
saccver, staxids
XP_001778611.1, 3218
XP_006409976.1, 72664
NP_001149654.2, 4577
XP_014499952.1, 3916
XP_016732373.1, 3635

time esearch -db protein \
             -query XP_001778611.1,NP_001149654.2,XP_014499952.1,XP_006409976.1,XP_016732373.1 \
             | elink -target taxonomy \
             | efetch -format uid;

72664
4577
3916
3635
3218
real 0m2.290s

Sadly, staxids results are sorted decreasingly, thus I don't see a way to map them to input saccver list here.

Last thing I tried was querying in the xml protein_db output, but I still fail to map saccver to staxids here :

time efetch -db protein \
            -id XP_001778611.1,XP_006409976.1,NP_001149654.2 \
            -format xml \
            | xtract -pattern Bioseq-set_seq-set \
                     -group Seqdesc_source \
                     -sep "@" \
                     -element Object-id_id \
                     -group Textseq-id \
                     -sep "@" \
                     -element Textseq-id_accession;

3218 72664 4577 XM_001778559 XP_001778611 XM_006409913 XP_006409976 NM_001156182 NP_001149654
real 0m1.204s

Any suggestions ?
Best regards

ADD REPLY • link 6.5 years ago by erwan.scaon ▴ 930

1

Entering edit mode

If you have a very large set of IDs, maybe you could run a local search against the accession to taxonomy data using the files available on ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/

ADD REPLY • link 6.5 years ago by Sej Modha 5.3k

0

Entering edit mode

That seems clever, I'll try it this afternoon

ADD REPLY • link 6.5 years ago by erwan.scaon ▴ 930

1

Entering edit mode

Ok, I should be able to handle it from there, thanks for the tips :

time zgrep -f list_5.txt \
           prot.accession2taxid.gz \
           | awk '{print $2"\t"$3}' $_

XP_001778611.1 3218
NP_001149654.2 4577
XP_006409976.1 72664
XP_014499952.1 3916
XP_016732373.1 3635

real 1m33.959s

ADD REPLY • link 6.5 years ago by erwan.scaon ▴ 930

0

Entering edit mode

Glad to know! An independent awk matching should be quicker than grep.

ADD REPLY • link 6.5 years ago by Sej Modha 5.3k

0

Entering edit mode

The last hurdle :

My protein database was downloaded from ncbi ftp :

$ wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/plant/*.protein.faa.gz &>> wget.verbose;
$ cat *faa.gz > plant_refseq.faa.gz;

Then, I launch a blastx against this plant_refseq database.

When i query all saccver in my blastx output vs prot.accession2taxid.gz, some are not found.
Here are some examples :
XP_004986996.1
XP_012703670.2
XP_013748334.1

The first saccver problem (XP_004986996.1) is easy to solve :
In my blast, the subject was "XP_004986996.1".
But in the file prot.accession2taxid.gz, it was updated to the latest version = XP_004986996.2.

 $ zgrep 'XP_004986996' prot.accession2taxid.gz;
 accession  accession.version   taxid   gi
 XP_004986996   XP_004986996.2  4555    1256755570

Thus my search should be done on accession (XP_004986996.) and not accession+version (XP_004986996.1).

What's worrying is the last 2 saccver examples :
They cannot be found with the above correction.

I can query them in my initial file :

$ zgrep 'XP_013748334.1' ~/Desktop/RTDG/BISCEm/Data/plant_refseq.faa.gz;
>XP_013748334.1 PREDICTED: uncharacterized protein LOC106451027 isoform X1 [Brassica napus]

If you run an ncbi search with the term "XP_013748334", you find nothing.
If you try with "LOC106451027", you get : https://www.ncbi.nlm.nih.gov/protein/92362375.
On this page we find :
Accession : XP_013748335
REFSEQ: accession XM_013892881.2

Let's try to query with this newly found accession :

$ zgrep 'XP_013748335' prot.accession2taxid.gz 
XP_013748335    XP_013748335.1  3708    923623754

$ esearch -db protein \
          -query XP_013748335 \
          | elink -target taxonomy \
          | efetch -format uid;
3708

But it seems pretty hard to traceback / implement from my initial blastx result.

I did read ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/README, but I still do not figure how prot.accession2taxid was constructed : Is it logical that protein with "uncharacterized", "probable", etc... status are not included (which is the case in my example) ?

To rephrase, it seems that entries in prot.accession2taxid and plant_refseq.faa.gz have not been updated at the same time (things in plant_refseq.faa.gz being older it seems). Is there a way to be on the same page for both?

ADD REPLY • link 6.5 years ago by erwan.scaon ▴ 930

1

Entering edit mode

It looks like that record has been removed from https://www.ncbi.nlm.nih.gov/protein/XP_013748334.1?report=genpept. I would download dead_prot.accession2taxid files and combine it with the ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz to ensure that you capture all accession numbers.

ADD REPLY • link 6.5 years ago by Sej Modha 5.3k

0

Entering edit mode

Ok, it seems to be a clever suggestion once again. Thanks ! (I lack a bit of knowledge on how ncbi data are organized / how to query them efficiently)

ADD REPLY • link 6.5 years ago by erwan.scaon ▴ 930

0

Entering edit mode

and if you think you have pdb accession numbers in the customised dataset then download the ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/pdb.accession2taxid.gz and combine them with the other files.

ADD REPLY • link 6.5 years ago by Sej Modha 5.3k

score 1 · Answer 2 · 2017-10-09

1

Entering edit mode

6.5 years ago

Pierre Lindenbaum 161k

Here is a javascript code using CORS / XML / XPATH / NCBI-ESummary . Tested with firefox.

ADD COMMENT • link 6.5 years ago by Pierre Lindenbaum 161k

score 0 · Answer 3 · 2017-10-09

This may be a starting point for you : Using efetch + xtract (I guess you can call it inside your javascript code ?)

An example with this Pinus taeda nucleotide sequence : https://www.ncbi.nlm.nih.gov/nuccore/FJ070571

an='FJ070571';
efetch -db nuccore \
       -id $an \
       -format xml \
       | xtract -pattern Bioseq-set_seq-set \
                -division Seq-entry \
                -group Bioseq \
                -if Textseq-id_accession -equals $an \
                -element Textseq-id_accession \
                -branch Bioseq_descr \
                -block Org-ref_db \
                -element Object-id_id;

Output = 3352, which is the taxonomy ID for Pinus taeda (https://www.ncbi.nlm.nih.gov/taxonomy/?term=3352)

Ps : This command was not thoroughly tested, thus the efetch parsing might need some polishing to be successful for all possibles cases.