Question: Trying to match Accession numbers to Taxonomy IDs
0
gravatar for New2programming
9 weeks ago by
liverpool university
New2programming0 wrote:

Currently writing a sequence alignment programme in javascript and i want to be able to match the genbank accession numbers to species names and for a list of 500+ accessions. Is there any easy way to do this rather than changing over to python?

ADD COMMENTlink modified 9 weeks ago by Pierre Lindenbaum102k • written 9 weeks ago by New2programming0

in javascript

context needed: is it a cmd-line program or is it a web-page ?

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by Pierre Lindenbaum102k

its on a webpage so i'm work in java then convert to html format

ADD REPLYlink written 9 weeks ago by New2programming0

in java ????? I hope you mean javascript ?!

ADD REPLYlink written 9 weeks ago by Pierre Lindenbaum102k

Java is to javascript as ham is to hamster, right? :-p

ADD REPLYlink written 9 weeks ago by WouterDeCoster24k

yup javascript! as the name suggests this is all new territory for me!

ADD REPLYlink written 8 weeks ago by New2programming0
1
gravatar for Sej Modha
9 weeks ago by
Sej Modha2.3k
Glasgow, UK
Sej Modha2.3k wrote:

A simplified version of the above example with the edirect unix eutils would be:

esearch -db nuccore -query FJ070571|elink -target taxonomy |efetch -format uid
ADD COMMENTlink written 9 weeks ago by Sej Modha2.3k

Dear community,

I don't think the following question deserve a new post, given that it's largely inspired from Sej Modha answer, thus I am commenting on his answer :

Question title : "Retrieving blast staxids from saccver with ncbi E-utilities"

I am parsing blast+ output (custom outfmt 6 with saccver column).
I am interested in retrieving the staxids info for each hit.
The blast was done against a custom database, thus I cannot request the staxids column directly in the output.
But I can map saccver with staxids.

My issue is that I did not manage to do it efficiently yet, let me explain :

The output contains a large number of hits, representing many different saccver.
Ideally, I need to be able to query thousands of saccver at once with efetch / esearch and get a result where each saccver is associated with the corresponding staxids (a Python dictionary for example).

If I loop on my list of saccver, the result is fine but slow : Example for a single query :

time esearch -db protein \
             -query XP_001778611.1 \
             | elink -target taxonomy \
             | efetch -format uid;

3218
real 0m2.332s

Trying the same as above with 5 queries, hoping that the results are in the same order as the input :
Truth :
saccver, staxids
XP_001778611.1, 3218
XP_006409976.1, 72664
NP_001149654.2, 4577
XP_014499952.1, 3916
XP_016732373.1, 3635

time esearch -db protein \
             -query XP_001778611.1,NP_001149654.2,XP_014499952.1,XP_006409976.1,XP_016732373.1 \
             | elink -target taxonomy \
             | efetch -format uid;

72664
4577
3916
3635
3218
real 0m2.290s

Sadly, staxids results are sorted decreasingly, thus I don't see a way to map them to input saccver list here.

Last thing I tried was querying in the xml protein_db output, but I still fail to map saccver to staxids here :

time efetch -db protein \
            -id XP_001778611.1,XP_006409976.1,NP_001149654.2 \
            -format xml \
            | xtract -pattern Bioseq-set_seq-set \
                     -group Seqdesc_source \
                     -sep "@" \
                     -element Object-id_id \
                     -group Textseq-id \
                     -sep "@" \
                     -element Textseq-id_accession;

3218 72664 4577 XM_001778559 XP_001778611 XM_006409913 XP_006409976 NM_001156182 NP_001149654
real 0m1.204s

Any suggestions ?
Best regards

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by erwan.scaon330
1

If you have a very large set of IDs, maybe you could run a local search against the accession to taxonomy data using the files available on ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Sej Modha2.3k

That seems clever, I'll try it this afternoon

ADD REPLYlink written 6 weeks ago by erwan.scaon330
1

Ok, I should be able to handle it from there, thanks for the tips :

time zgrep -f list_5.txt \
           prot.accession2taxid.gz \
           | awk '{print $2"\t"$3}' $_

XP_001778611.1 3218
NP_001149654.2 4577
XP_006409976.1 72664
XP_014499952.1 3916
XP_016732373.1 3635

real 1m33.959s

ADD REPLYlink written 6 weeks ago by erwan.scaon330

Glad to know! An independent awk matching should be quicker than grep.

ADD REPLYlink written 6 weeks ago by Sej Modha2.3k

The last hurdle :

My protein database was downloaded from ncbi ftp :

$ wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/plant/*.protein.faa.gz &>> wget.verbose;
$ cat *faa.gz > plant_refseq.faa.gz;

Then, I launch a blastx against this plant_refseq database.

When i query all saccver in my blastx output vs prot.accession2taxid.gz, some are not found.
Here are some examples :
XP_004986996.1
XP_012703670.2
XP_013748334.1

The first saccver problem (XP_004986996.1) is easy to solve :
In my blast, the subject was "XP_004986996.1".
But in the file prot.accession2taxid.gz, it was updated to the latest version = XP_004986996.2.

 $ zgrep 'XP_004986996' prot.accession2taxid.gz;
 accession  accession.version   taxid   gi
 XP_004986996   XP_004986996.2  4555    1256755570

Thus my search should be done on accession (XP_004986996.) and not accession+version (XP_004986996.1).

What's worrying is the last 2 saccver examples :
They cannot be found with the above correction.

I can query them in my initial file :

$ zgrep 'XP_013748334.1' ~/Desktop/RTDG/BISCEm/Data/plant_refseq.faa.gz;
>XP_013748334.1 PREDICTED: uncharacterized protein LOC106451027 isoform X1 [Brassica napus]

If you run an ncbi search with the term "XP_013748334", you find nothing.
If you try with "LOC106451027", you get : https://www.ncbi.nlm.nih.gov/protein/92362375.
On this page we find :
Accession : XP_013748335
REFSEQ: accession XM_013892881.2

Let's try to query with this newly found accession :

$ zgrep 'XP_013748335' prot.accession2taxid.gz 
XP_013748335    XP_013748335.1  3708    923623754

$ esearch -db protein \
          -query XP_013748335 \
          | elink -target taxonomy \
          | efetch -format uid;
3708

But it seems pretty hard to traceback / implement from my initial blastx result.

I did read ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/README, but I still do not figure how prot.accession2taxid was constructed : Is it logical that protein with "uncharacterized", "probable", etc... status are not included (which is the case in my example) ?

To rephrase, it seems that entries in prot.accession2taxid and plant_refseq.faa.gz have not been updated at the same time (things in plant_refseq.faa.gz being older it seems). Is there a way to be on the same page for both?

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by erwan.scaon330
1

It looks like that record has been removed from https://www.ncbi.nlm.nih.gov/protein/XP_013748334.1?report=genpept. I would download dead_prot.accession2taxid files and combine it with the ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz to ensure that you capture all accession numbers.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Sej Modha2.3k

Ok, it seems to be a clever suggestion once again. Thanks ! (I lack a bit of knowledge on how ncbi data are organized / how to query them efficiently)

ADD REPLYlink written 6 weeks ago by erwan.scaon330

and if you think you have pdb accession numbers in the customised dataset then download the ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/pdb.accession2taxid.gz and combine them with the other files.

ADD REPLYlink written 6 weeks ago by Sej Modha2.3k
1
gravatar for Pierre Lindenbaum
9 weeks ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum102k wrote:

Here is a javascript code using CORS / XML / XPATH / NCBI-ESummary . Tested with firefox.

ADD COMMENTlink written 9 weeks ago by Pierre Lindenbaum102k
0
gravatar for erwan.scaon
9 weeks ago by
erwan.scaon330
Limoges - CBRS - France
erwan.scaon330 wrote:

This may be a starting point for you : Using efetch + xtract (I guess you can call it inside your javascript code ?)

An example with this Pinus taeda nucleotide sequence : https://www.ncbi.nlm.nih.gov/nuccore/FJ070571

an='FJ070571';
efetch -db nuccore \
       -id $an \
       -format xml \
       | xtract -pattern Bioseq-set_seq-set \
                -division Seq-entry \
                -group Bioseq \
                -if Textseq-id_accession -equals $an \
                -element Textseq-id_accession \
                -branch Bioseq_descr \
                -block Org-ref_db \
                -element Object-id_id;

Output = 3352, which is the taxonomy ID for Pinus taeda (https://www.ncbi.nlm.nih.gov/taxonomy/?term=3352)

Ps : This command was not thoroughly tested, thus the efetch parsing might need some polishing to be successful for all possibles cases.

ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by erwan.scaon330
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 657 users visited in the last hour