Accession number to taxonomy id after blasting
2
1
Entering edit mode
5.9 years ago
Earendil ▴ 50

I have performed a batch blastn search against ncbi nt without specifying staxids while doing having output format as -outfmt 6. Am I doomed to repeat the blast again in the correct way this time or is there a workaround like someway to parse taxdb?

Thank you.

blast • 7.0k views
0
Entering edit mode

Assuming the following still works: Automatically Getting The Ncbi Taxonomy Id From The Genbank Identifier

You could also use taxonomy.sh from BBMap. You will first need to create a local copy of the tree using taxtree.sh and names.dump file from NCBI FTP.

Description:  Creates tree.taxtree from names.dmp and nodes.dmp.
These are in taxdmp.zip available at ftp://ftp.ncbi.nih.gov/pub/taxonomy/
The taxtree file is needed for programs that can deal with taxonomy,
like Seal and SortByTaxa.

Usage:  taxtree.sh names.dmp nodes.dmp tree.taxtree.gz

2
Entering edit mode
5.9 years ago
Sej Modha 5.2k

You can download ncbi taxonomy ID mapping for accession number if you have a really long list of accession that you would like to fetch taxonomy ID for.

0
Entering edit mode

That's exactly what I did, and thought that it is not the right file for the Nt database (I took the nucl_gb.accession2taxid.gz only) because I got some results that didn't make sense, but I am investigating it again right now. It could be that my problem is something else after all.

1
Entering edit mode

worth looking at the dead_* files just in case.

0
Entering edit mode
5.9 years ago
jeremy.cox.2 ▴ 110

It is hard to understand your exact problem without an example BLAST outformat 6 line. However, I routinely do this sort of conversion, and with some minimal programming you can convert this.

If your database is built from NCBI database, then the Accession number should be in the reference sequence name.

EXAMPLE:

ST-E00106:201:HCNTGCCXX:6:2110:30645:11646/1 gi|9626372|ref|NC_001422.1| 98.67 150 2 0 1 150 2430 2281 5e-69 267

The part in bold (column #2) gives me the name of the aligned sequence, which contains the Accession number NC_001422.1. Note that this output using an old database that still included gid numbers - so your example might be different.

In Python, I would use this command to extract the Accession number:

accession = line.split("\t")[1].split("|")[3]
percentIdentity = float(line.split("\t")[2])


NCBI provides a list of accession numbers and corresponding taxon IDs. ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/ You can grep or search this file in other ways to get the answer you need. Typically, in Python, I make a dictionary of the Accession numbers I need, and then read the whole file, extracting only the information I need. This code would depend on the exact files and formats you have.

You can also go here ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/ and download a taxdump archive. Inside this archive is a nodes.dmp and names.dmp, which define the relationships in the NCBI taxonomy tree.

So, you must first convert the Accession number to a taxon ID, and that taxon ID may be a species, maybe a strain, or maybe something else; some sequences are assigned to a phylum, for example. With a little work, you can find out your species or genus.

I have a personal code library for the steps involving parsing the taxonomy tree, after the accession number is converted to a taxon ID. Be warned, I never wrote it thinking someone else would use it! Note that giLookup() function could be modified to do an accessionLookup() fairly easily. You need to download the files above and get the PATH variable configured correctly for it to work.

0
Entering edit mode

Use GitHub Gist to share code. Biostars understands GitHub.

0
Entering edit mode

How does that work? Do yu need to embed the Gist URL?

0
Entering edit mode

Just provide/embed URL for your public gist code in text of your post.

0
Entering edit mode

It worked like magic. Thanks

0
Entering edit mode

Your answer and script is very elaborate and thank you for this. However, since I didn't state the reason why I want the taxIDs in the first place, I am going to say it now. I only wanted to know if a sequence with a hit was found to be fungal. Thus, I downloaded the taxID list of fungal taxa (fungi and all subtaxa) with esearch and efetch from ncbi. Then, I downloaded the file ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz and finally I made a python script which would take the accnr from each of my hits, see which is its' taxid from nucl_gb.accession2taxid and see if that taxid was fungal, from the list I made before.

My problem was that after retrieving the "fungal" reads (because I was blasting reads) with my script, after evaluating the results with 2 other online tools (kegg and kaiju), the tools indicated that I had 25% fungi in this instead. That made me think that I chose the wrong file (nucl_gb.accession2taxid.gz) for the database that I am blasting to (the NT database, ftp://ftp.ncbi.nih.gov/blast/db/nt*).

Apparently, it seems that neither my script is wrong nor the file is inappropriate for the database and so I believe that the difference of blastn vs kegg + kaiju is due to the databases + search algorithms that they use, although I would like to hear opinions on this.

0
Entering edit mode

BLAST results can be limited to a number of GI or seqid by providing a list of GI or seqid when you run blast. For more info checkout BLAST options.

0
Entering edit mode

Thank you for your reply, I didn't know about that option. However it would not be what I want as this limits the search space for the database to the specified GIs and I what the search to be done to all of the database, but have the results to be limited to fungi. I could do that afterwards by specifying in the -outfmt 6 option the "staxids" paramater. Sadly I didn't do that and it would be too time consuming to perform blast again for me right now.

0
Entering edit mode

Hi! Can you share the code that you used for converting the accession ids to taxids from the blast output file?

1
Entering edit mode

You can use following one liner with esearch unix utils. Update the database type and accession number in the following esearch command to obtain taxonomy ID for specific accession number.

esearch -db nucleotide -query 'AccessionNumber'|elink -target taxonomy|efetch -format uid

0
Entering edit mode

Thanks for the reply Sej Modha.

I tried to install: cd ~ perl -MNet::FTP -e \ '$ftp = new Net::FTP("ftp.ncbi.nlm.nih.gov", Passive => 1);$ftp->login; $ftp->binary;$ftp->get("/entrez/entrezdirect/edirect.zip");' unzip -u -q edirect.zip rm edirect.zip export PATH=$PATH:$HOME/edirect ./edirect/setup.sh

and followed the further instructions.

But I tried with a single example ID, it doesn't work. Could you check and let me know?

esearch -db protein -query 345010482|elink -target taxonomy|efetch -format uid > result

0
Entering edit mode

I am not sure why it doesn't work. Maybe coz you are using GIs instead of accession number and GIs are now phased out by NCBI. https://ncbiinsights.ncbi.nlm.nih.gov/2016/07/15/ncbi-is-phasing-out-sequence-gis-heres-what-you-need-to-know/

0
Entering edit mode

Dear Jeremy,

Do you have a code than can extract the TaxID based on a list of accession ID?

Thanks!

0
Entering edit mode

esearch -db nuccore  -query NC_001422.1|elink -target taxonomy|efetch -format uid