3.7 years ago by
It is hard to understand your exact problem without an example BLAST outformat 6 line. However, I routinely do this sort of conversion, and with some minimal programming you can convert this.
If your database is built from NCBI database, then the Accession number should be in the reference sequence name.
ST-E00106:201:HCNTGCCXX:6:2110:30645:11646/1 gi|9626372|ref|NC_001422.1| 98.67 150 2 0 1 150 2430 2281 5e-69 267
The part in bold (column #2) gives me the name of the aligned sequence, which contains the Accession number NC_001422.1.
Note that this output using an old database that still included gid numbers - so your example might be different.
In Python, I would use this command to extract the Accession number:
accession = line.split("\t").split("|")
percentIdentity = float(line.split("\t"))
NCBI provides a list of accession numbers and corresponding taxon IDs. ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/
You can grep or search this file in other ways to get the answer you need.
Typically, in Python, I make a dictionary of the Accession numbers I need, and then read the whole file, extracting only the information I need. This code would depend on the exact files and formats you have.
You can also go here ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/ and download a taxdump archive. Inside this archive is a nodes.dmp and names.dmp, which define the relationships in the NCBI taxonomy tree.
So, you must first convert the Accession number to a taxon ID, and that taxon ID may be a species, maybe a strain, or maybe something else; some sequences are assigned to a phylum, for example. With a little work, you can find out your species or genus.
I have a personal code library for the steps involving parsing the taxonomy tree, after the accession number is converted to a taxon ID. Be warned, I never wrote it thinking someone else would use it! Note that giLookup() function could be modified to do an accessionLookup() fairly easily. You need to download the files above and get the PATH variable configured correctly for it to work.