I'm interested in getting the scientific names of my blast hits ran locally. I see blast+ search apps have option
-outfmt which can take
sscinames(seems new in version Blast+ 2.2.28), but even using
nt from NCBI(no luck with local databases either) I get
N\A for this specifier. Similarity for '%S' of
$ blastdbcmd -db nt -entry 229577210 -outfmt '%a || %g || %T || %S || %t' NM_001743.4 || 229577210 || 9606 || N/A || Homo sapiens calmodulin 2 (phosphorylase kinase, delta) (CALM2), mRNA
Until now I've been using taxids in a very convoluted way. I will get the GIs from my hits, then query the blast db using
blastdbcmd to get the taxid and then query the local copy of the NCBI taxonomy database with bioperl to get the scientific name. Now that I see blast+ seems to be able to directly output the scientific name, I would like to simplify things. I'm already able to simplify things a little using the also new output format specifier
staxids, so I can now get the taxid directly from the blast output.
So my questions is.
- Is there a way to build local blast databases in a way so 'sscinames' can be used to output the scientific name in blast+ results?
In a side note. If there is a way, it seems odd NCBI's
nt is not built using it. At least that is the case for the version I got from Jul 11 2013.
Thanks in advance,
EDIT: I found I can now use
staxids to simplify my life a little. Some additional question formatting. NT updated to version from Jul 11.
Generally the sequence headers are taken from the fasta sequences. So if the fasta header has the info then blast output will display it.
makeblastdbis used to create a local database.
Sorry, but I think it is more complicated than that. For example, the taxid won't be parsed from the fasta header. If you want your locally build blast database to have taxid information for each record, you need to provide a gi to taxid map file. You can do this using
-taxid_map. My question is how can I now include scientific names when building a blast database so I can use the new output format specifier
Since the input has to include the information for it to be available in the BLAST database, I suspect this is one of the cases where you have to build the BLAST database from ASN.1 format data. However as you have noticed it appears that the BLAST databases provided by NCBI, at least 'nt' and 'nr' are missing the additional information for '%S' (and '%L').
This could be related to compatibility with the legacy NCBI BLAST programs, might be a decision made due to the resulting increase in database file size or it could be that the methods used to create these databases have problems with including this information. In either case it looks like your best bet is to contact the BLAST folks at NCBI (see http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs) and see if they can help with further information about which of their databases contain this information, and how to create your own databases containing this data.