Question: Accession number to taxonomy id after blasting
1
gravatar for Earendil
2.9 years ago by
Earendil20
Sweden
Earendil20 wrote:

I have performed a batch blastn search against ncbi nt without specifying staxids while doing having output format as -outfmt 6. Am I doomed to repeat the blast again in the correct way this time or is there a workaround like someway to parse taxdb?

Thank you.

blast • 3.4k views
ADD COMMENTlink modified 2.9 years ago by jeremy.cox.290 • written 2.9 years ago by Earendil20

Assuming the following still works: Automatically Getting The Ncbi Taxonomy Id From The Genbank Identifier

You could also use taxonomy.sh from BBMap. You will first need to create a local copy of the tree using taxtree.sh and names.dump file from NCBI FTP.

Description:  Creates tree.taxtree from names.dmp and nodes.dmp.
These are in taxdmp.zip available at ftp://ftp.ncbi.nih.gov/pub/taxonomy/
The taxtree file is needed for programs that can deal with taxonomy, 
like Seal and SortByTaxa.

Usage:  taxtree.sh names.dmp nodes.dmp tree.taxtree.gz
ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by genomax71k
2
gravatar for Sej Modha
2.9 years ago by
Sej Modha4.4k
Glasgow, UK
Sej Modha4.4k wrote:

You can download ncbi taxonomy ID mapping for accession number if you have a really long list of accession that you would like to fetch taxonomy ID for.

ADD COMMENTlink written 2.9 years ago by Sej Modha4.4k

That's exactly what I did, and thought that it is not the right file for the Nt database (I took the nucl_gb.accession2taxid.gz only) because I got some results that didn't make sense, but I am investigating it again right now. It could be that my problem is something else after all.

ADD REPLYlink written 2.9 years ago by Earendil20
1

worth looking at the dead_* files just in case.

ADD REPLYlink written 2.9 years ago by Sej Modha4.4k
0
gravatar for jeremy.cox.2
2.9 years ago by
jeremy.cox.290
United States
jeremy.cox.290 wrote:

It is hard to understand your exact problem without an example BLAST outformat 6 line. However, I routinely do this sort of conversion, and with some minimal programming you can convert this.

If your database is built from NCBI database, then the Accession number should be in the reference sequence name.

EXAMPLE:

ST-E00106:201:HCNTGCCXX:6:2110:30645:11646/1 gi|9626372|ref|NC_001422.1| 98.67 150 2 0 1 150 2430 2281 5e-69 267

The part in bold (column #2) gives me the name of the aligned sequence, which contains the Accession number NC_001422.1. Note that this output using an old database that still included gid numbers - so your example might be different.

In Python, I would use this command to extract the Accession number:

accession = line.split("\t")[1].split("|")[3]
percentIdentity = float(line.split("\t")[2])

NCBI provides a list of accession numbers and corresponding taxon IDs. ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/ You can grep or search this file in other ways to get the answer you need. Typically, in Python, I make a dictionary of the Accession numbers I need, and then read the whole file, extracting only the information I need. This code would depend on the exact files and formats you have.

You can also go here ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/ and download a taxdump archive. Inside this archive is a nodes.dmp and names.dmp, which define the relationships in the NCBI taxonomy tree.

So, you must first convert the Accession number to a taxon ID, and that taxon ID may be a species, maybe a strain, or maybe something else; some sequences are assigned to a phylum, for example. With a little work, you can find out your species or genus.

I have a personal code library for the steps involving parsing the taxonomy tree, after the accession number is converted to a taxon ID. Be warned, I never wrote it thinking someone else would use it! Note that giLookup() function could be modified to do an accessionLookup() fairly easily. You need to download the files above and get the PATH variable configured correctly for it to work.

ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by jeremy.cox.290

Use GitHub Gist to share code. Biostars understands GitHub.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by genomax71k

How does that work? Do yu need to embed the Gist URL?

ADD REPLYlink written 2.9 years ago by Santosh Anand4.9k

Just provide/embed URL for your public gist code in text of your post.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by genomax71k

It worked like magic. Thanks

ADD REPLYlink written 2.9 years ago by jeremy.cox.290

Your answer and script is very elaborate and thank you for this. However, since I didn't state the reason why I want the taxIDs in the first place, I am going to say it now. I only wanted to know if a sequence with a hit was found to be fungal. Thus, I downloaded the taxID list of fungal taxa (fungi and all subtaxa) with esearch and efetch from ncbi. Then, I downloaded the file ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz and finally I made a python script which would take the accnr from each of my hits, see which is its' taxid from nucl_gb.accession2taxid and see if that taxid was fungal, from the list I made before.

My problem was that after retrieving the "fungal" reads (because I was blasting reads) with my script, after evaluating the results with 2 other online tools (kegg and kaiju), the tools indicated that I had 25% fungi in this instead. That made me think that I chose the wrong file (nucl_gb.accession2taxid.gz) for the database that I am blasting to (the NT database, ftp://ftp.ncbi.nih.gov/blast/db/nt*).

Apparently, it seems that neither my script is wrong nor the file is inappropriate for the database and so I believe that the difference of blastn vs kegg + kaiju is due to the databases + search algorithms that they use, although I would like to hear opinions on this.

ADD REPLYlink written 2.9 years ago by Earendil20

BLAST results can be limited to a number of GI or seqid by providing a list of GI or seqid when you run blast. For more info checkout BLAST options.

ADD REPLYlink written 2.9 years ago by Sej Modha4.4k

Thank you for your reply, I didn't know about that option. However it would not be what I want as this limits the search space for the database to the specified GIs and I what the search to be done to all of the database, but have the results to be limited to fungi. I could do that afterwards by specifying in the -outfmt 6 option the "staxids" paramater. Sadly I didn't do that and it would be too time consuming to perform blast again for me right now.

ADD REPLYlink written 2.8 years ago by Earendil20

Hi! Can you share the code that you used for converting the accession ids to taxids from the blast output file?

ADD REPLYlink written 2.3 years ago by Promi10
1

You can use following one liner with esearch unix utils. Update the database type and accession number in the following esearch command to obtain taxonomy ID for specific accession number.

esearch -db nucleotide -query 'AccessionNumber'|elink -target taxonomy|efetch -format uid
ADD REPLYlink written 2.3 years ago by Sej Modha4.4k

Thanks for the reply Sej Modha.

I tried to install: cd ~ perl -MNet::FTP -e \ '$ftp = new Net::FTP("ftp.ncbi.nlm.nih.gov", Passive => 1); $ftp->login; $ftp->binary; $ftp->get("/entrez/entrezdirect/edirect.zip");' unzip -u -q edirect.zip rm edirect.zip export PATH=$PATH:$HOME/edirect ./edirect/setup.sh

and followed the further instructions.

But I tried with a single example ID, it doesn't work. Could you check and let me know?

esearch -db protein -query 345010482|elink -target taxonomy|efetch -format uid > result

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Promi10

I am not sure why it doesn't work. Maybe coz you are using GIs instead of accession number and GIs are now phased out by NCBI. https://ncbiinsights.ncbi.nlm.nih.gov/2016/07/15/ncbi-is-phasing-out-sequence-gis-heres-what-you-need-to-know/

ADD REPLYlink written 2.3 years ago by Sej Modha4.4k

Dear Jeremy,

Do you have a code than can extract the TaxID based on a list of accession ID?

Thanks!

ADD REPLYlink written 5 months ago by Ming30

Following code can help you with that.

esearch -db nuccore  -query NC_001422.1|elink -target taxonomy|efetch -format uid
ADD REPLYlink modified 5 months ago • written 5 months ago by Sej Modha4.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1951 users visited in the last hour