Accession number to taxonomy id after blasting

Entering edit mode

8.6 years ago

Earendil ▴ 50

I have performed a batch blastn search against ncbi nt without specifying staxids while doing having output format as -outfmt 6. Am I doomed to repeat the blast again in the correct way this time or is there a workaround like someway to parse taxdb?

Thank you.

blast • 11k views

ADD COMMENT • link updated 8.6 years ago by jeremy.cox.2 ▴ 130 • written 8.6 years ago by Earendil ▴ 50

Entering edit mode

Assuming the following still works: Automatically Getting The Ncbi Taxonomy Id From The Genbank Identifier

You could also use taxonomy.sh from BBMap. You will first need to create a local copy of the tree using taxtree.sh and names.dump file from NCBI FTP.

Description:  Creates tree.taxtree from names.dmp and nodes.dmp.
These are in taxdmp.zip available at ftp://ftp.ncbi.nih.gov/pub/taxonomy/
The taxtree file is needed for programs that can deal with taxonomy, 
like Seal and SortByTaxa.

Usage:  taxtree.sh names.dmp nodes.dmp tree.taxtree.gz

ADD REPLY • link 8.6 years ago by GenoMax 151k

Entering edit mode

8.6 years ago

Sej Modha 5.3k

You can download ncbi taxonomy ID mapping for accession number if you have a really long list of accession that you would like to fetch taxonomy ID for.

ADD COMMENT • link 8.6 years ago by Sej Modha 5.3k

Entering edit mode

That's exactly what I did, and thought that it is not the right file for the Nt database (I took the nucl_gb.accession2taxid.gz only) because I got some results that didn't make sense, but I am investigating it again right now. It could be that my problem is something else after all.

ADD REPLY • link 8.6 years ago by Earendil ▴ 50

Entering edit mode

worth looking at the dead_* files just in case.

ADD REPLY • link 8.6 years ago by Sej Modha 5.3k

Entering edit mode

8.6 years ago

jeremy.cox.2 ▴ 130

It is hard to understand your exact problem without an example BLAST outformat 6 line. However, I routinely do this sort of conversion, and with some minimal programming you can convert this.

If your database is built from NCBI database, then the Accession number should be in the reference sequence name.

EXAMPLE:

ST-E00106:201:HCNTGCCXX:6:2110:30645:11646/1 gi|9626372|ref|NC_001422.1| 98.67 150 2 0 1 150 2430 2281 5e-69 267

The part in bold (column #2) gives me the name of the aligned sequence, which contains the Accession number NC_001422.1. Note that this output using an old database that still included gid numbers - so your example might be different.

In Python, I would use this command to extract the Accession number:

accession = line.split("\t")[1].split("|")[3]
percentIdentity = float(line.split("\t")[2])

NCBI provides a list of accession numbers and corresponding taxon IDs. ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/ You can grep or search this file in other ways to get the answer you need. Typically, in Python, I make a dictionary of the Accession numbers I need, and then read the whole file, extracting only the information I need. This code would depend on the exact files and formats you have.

You can also go here ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/ and download a taxdump archive. Inside this archive is a nodes.dmp and names.dmp, which define the relationships in the NCBI taxonomy tree.

So, you must first convert the Accession number to a taxon ID, and that taxon ID may be a species, maybe a strain, or maybe something else; some sequences are assigned to a phylum, for example. With a little work, you can find out your species or genus.

I have a personal code library for the steps involving parsing the taxonomy tree, after the accession number is converted to a taxon ID. Be warned, I never wrote it thinking someone else would use it! Note that giLookup() function could be modified to do an accessionLookup() fairly easily. You need to download the files above and get the PATH variable configured correctly for it to work.


	### Code by Jeremy.Cox@cchmc.org, provided as freeware without warranty ###
	### A library for processing blast hits which provide NCBI gid / accession numbers to extract meaningful taxonomy information.###
	### Warning: I did not write this for public consumption, so watch out for unintended bugs! ###

	### This code written BEFORE the switch to Accession numbers away from gid ###
	### Python 2.7 ###

	import re
	import sys
	import os
	from subprocess import Popen, PIPE

	PATH="/data/aplab/microbiome_genome_only/ncbi_gi_taxid/"
	# ncbi gi dump files location
	# PATH="./"

	LINNAEUS_FILTER = ["species","genus","family","order","class","phylum","kingdom","superkingdom"]

	def buildNames():
	inFile = open(PATH+"names.dmp")

	result = dict()

	for line in inFile:
	line = line.replace("\t","")
	line = line.replace("\|\n","")
	splits = line.split("\|")
	if splits[-1] == 'scientific name':
	key = int( line.split("\|")[0] )
	name = line.split("\|")[1]
	result[key] = name

	return result

	print >> sys.stderr, 'taxa_library_cluster initializing :: Building Names database'
	g_names = buildNames()

	def buildNodes():
	inFile = open(PATH+"nodes.dmp")

	result = dict()
	levels = dict()

	for line in inFile:
	#print line
	line = line.replace("\t","")
	line = line.replace("\|\n","")
	splits = line.split("\|")
	#print splits
	taxaID = int(splits[0])
	parentID = int(splits[1])
	level = splits[2]
	result[taxaID] = (parentID, level)
	levels[level]=1

	return (result,levels)

	print >> sys.stderr, 'taxa_library_cluster initializing :: Building Nodes database'
	(g_nodes, g_levels) = buildNodes()

	def buildLineage( x , names, nodes, filter = LINNAEUS_FILTER ):
	k = int(x)
	result={}

	while k != 1 and k != 131567:
	if k in nodes:
	(next, level) = nodes[k]
	if filter == None or level in filter:
	result[level] = (k,names[k])
	k = next
	else:
	# https://docs.python.org/3/library/exceptions.html#exception-hierarchy
	raise AssertionError("Looking up taxa %d :: ancestor %d is not in taxonomy" % (x,k))
	break

	return result

	def giListLookup( giList ):
	#takes a list of gi and returns dictionary mapping gi to taxaID
	taxonomy = {}
	for line in open(PATH+"gi_taxid_nucl.dmp"):
	(gi, taxid) = line.split()
	if gi in giList:
	taxonomy[gi] = int(taxid)
	return taxonomy

	def giLookup( gid ):
	result = -1
	'''
	#commands = ["grep", "-Pm", "1", '"^'+str(gid)+'\'+'t"', PATH+"gi_taxid_nucl.dmp", "\|", "cut", "-f2"]
	commands = ["grep", "-Pm", "1", '"^'+str(gid)+'"', PATH+"gi_taxid_nucl.dmp", "\|", "cut", "-f2"]

	print >> sys.stderr, commands
	#temp = ""
	#for item in commands:
	# temp += item+" "
	#print >> sys.stderr, temp
	print >> sys.stderr, " ".join(commands)

	proc = Popen(commands, stdout=PIPE, bufsize=1)
	#proc = Popen(temp, stdout=PIPE, bufsize=1)
	for line in iter(proc.stdout.readline, b''):
	result = int(line)
	proc.communicate()
	'''
	commands = ["grep", "-Pm", "1", '"^'+str(gid)+'"', PATH+"gi_taxid_nucl.dmp", "\|", "cut", "-f2", ">", "temp"]
	os.system(" ".join(commands))
	for line in open("temp", "r"):
	result = int(line)
	return result

	def getSpeciesID( taxaID ):
	result = 0
	currID = taxaID
	if not taxaID in g_nodes:
	result = -1
	else:
	while currID != 1 and currID != 131567:
	if g_nodes[currID][1] == "species":
	result = currID
	break
	currID = g_nodes[currID][0]
	return result

	def getGID( text ):
	splits = text[1:].split("\|")
	for k in range(len(splits)-1):
	if re.search("gi",splits[k]):
	return splits[k+1]
	return "-1"

	if __name__== "__main__":
	query = 4932
	queryName = "Saccharomyces cerevisiae"

	names = buildNames()
	(nodes, levels) = buildNodes()

	taxonomy = buildLineage(query, names, nodes)

	result = str(query)+"\t"+queryName+"\|"

	for level in LINNAEUS_FILTER:
	if level in taxonomy:
	result += "\t"+str(taxonomy[level][0])+"\t"+taxonomy[level][1]+"\|"
	else:
	result += "\tnone\tnone"

	print result

view raw taxonomyLibrary.py hosted with ❤ by GitHub

ADD COMMENT • link 8.6 years ago by jeremy.cox.2 ▴ 130

Entering edit mode

Use GitHub Gist to share code. Biostars understands GitHub.

ADD REPLY • link 8.6 years ago by GenoMax 151k

Entering edit mode

How does that work? Do yu need to embed the Gist URL?

ADD REPLY • link 8.6 years ago by Santosh Anand 5.8k

Entering edit mode

Just provide/embed URL for your public gist code in text of your post.

ADD REPLY • link 8.6 years ago by GenoMax 151k

Entering edit mode

It worked like magic. Thanks

ADD REPLY • link 8.6 years ago by jeremy.cox.2 ▴ 130

Entering edit mode

Your answer and script is very elaborate and thank you for this. However, since I didn't state the reason why I want the taxIDs in the first place, I am going to say it now. I only wanted to know if a sequence with a hit was found to be fungal. Thus, I downloaded the taxID list of fungal taxa (fungi and all subtaxa) with esearch and efetch from ncbi. Then, I downloaded the file ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz and finally I made a python script which would take the accnr from each of my hits, see which is its' taxid from nucl_gb.accession2taxid and see if that taxid was fungal, from the list I made before.

My problem was that after retrieving the "fungal" reads (because I was blasting reads) with my script, after evaluating the results with 2 other online tools (kegg and kaiju), the tools indicated that I had 25% fungi in this instead. That made me think that I chose the wrong file (nucl_gb.accession2taxid.gz) for the database that I am blasting to (the NT database, ftp://ftp.ncbi.nih.gov/blast/db/nt*).

Apparently, it seems that neither my script is wrong nor the file is inappropriate for the database and so I believe that the difference of blastn vs kegg + kaiju is due to the databases + search algorithms that they use, although I would like to hear opinions on this.

ADD REPLY • link 8.6 years ago by Earendil ▴ 50

Entering edit mode

BLAST results can be limited to a number of GI or seqid by providing a list of GI or seqid when you run blast. For more info checkout BLAST options.

ADD REPLY • link 8.6 years ago by Sej Modha 5.3k

Entering edit mode

Thank you for your reply, I didn't know about that option. However it would not be what I want as this limits the search space for the database to the specified GIs and I what the search to be done to all of the database, but have the results to be limited to fungi. I could do that afterwards by specifying in the -outfmt 6 option the "staxids" paramater. Sadly I didn't do that and it would be too time consuming to perform blast again for me right now.

ADD REPLY • link 8.6 years ago by Earendil ▴ 50

Entering edit mode

Hi! Can you share the code that you used for converting the accession ids to taxids from the blast output file?

ADD REPLY • link 8.0 years ago by Promi ▴ 10

Entering edit mode

You can use following one liner with esearch unix utils. Update the database type and accession number in the following esearch command to obtain taxonomy ID for specific accession number.

esearch -db nucleotide -query 'AccessionNumber'|elink -target taxonomy|efetch -format uid

ADD REPLY • link 8.0 years ago by Sej Modha 5.3k

Entering edit mode

Thanks for the reply Sej Modha.

I tried to install: cd ~ perl -MNet::FTP -e \ '$ftp = new Net::FTP("ftp.ncbi.nlm.nih.gov", Passive => 1); $ftp->login; $ftp->binary; $ftp->get("/entrez/entrezdirect/edirect.zip");' unzip -u -q edirect.zip rm edirect.zip export PATH=$PATH:$HOME/edirect ./edirect/setup.sh

and followed the further instructions.

But I tried with a single example ID, it doesn't work. Could you check and let me know?

esearch -db protein -query 345010482|elink -target taxonomy|efetch -format uid > result

ADD REPLY • link 8.0 years ago by Promi ▴ 10

Entering edit mode

I am not sure why it doesn't work. Maybe coz you are using GIs instead of accession number and GIs are now phased out by NCBI. https://ncbiinsights.ncbi.nlm.nih.gov/2016/07/15/ncbi-is-phasing-out-sequence-gis-heres-what-you-need-to-know/

ADD REPLY • link 8.0 years ago by Sej Modha 5.3k

Entering edit mode

Dear Jeremy,

Do you have a code than can extract the TaxID based on a list of accession ID?

Thanks!

ADD REPLY • link 6.2 years ago by Ming ▴ 110

Entering edit mode

Following code can help you with that.

esearch -db nuccore  -query NC_001422.1|elink -target taxonomy|efetch -format uid

ADD REPLY • link 6.2 years ago by Sej Modha 5.3k