Scientific Names In Blast Output And Databases
5
13
Entering edit mode
8.4 years ago
Carlos Borroto ★ 2.0k

Hi,

I'm interested in getting the scientific names of my blast hits ran locally. I see blast+ search apps have option -outfmt which can take sscinames(seems new in version Blast+ 2.2.28), but even using nt from NCBI(no luck with local databases either) I get N\A for this specifier. Similarity for '%S' of -outfmt in blastdbcmd.

For example:

$blastdbcmd -db nt -entry 229577210 -outfmt '%a || %g || %T || %S || %t' NM_001743.4 || 229577210 || 9606 || N/A || Homo sapiens calmodulin 2 (phosphorylase kinase, delta) (CALM2), mRNA  Until now I've been using taxids in a very convoluted way. I will get the GIs from my hits, then query the blast db using blastdbcmd to get the taxid and then query the local copy of the NCBI taxonomy database with bioperl to get the scientific name. Now that I see blast+ seems to be able to directly output the scientific name, I would like to simplify things. I'm already able to simplify things a little using the also new output format specifier staxids, so I can now get the taxid directly from the blast output. So my questions is. • Is there a way to build local blast databases in a way so 'sscinames' can be used to output the scientific name in blast+ results? In a side note. If there is a way, it seems odd NCBI's nt is not built using it. At least that is the case for the version I got from Jul 11 2013. Thanks in advance, Carlos EDIT: I found I can now use staxids to simplify my life a little. Some additional question formatting. NT updated to version from Jul 11. blast blast+ • 26k views ADD COMMENT 0 Entering edit mode Generally the sequence headers are taken from the fasta sequences. So if the fasta header has the info then blast output will display it. makeblastdb is used to create a local database. ADD REPLY 0 Entering edit mode Sorry, but I think it is more complicated than that. For example, the taxid won't be parsed from the fasta header. If you want your locally build blast database to have taxid information for each record, you need to provide a gi to taxid map file. You can do this using makeblastdb option -taxid_map. My question is how can I now include scientific names when building a blast database so I can use the new output format specifier sscinames. Thanks, Carlos ADD REPLY 0 Entering edit mode Since the input has to include the information for it to be available in the BLAST database, I suspect this is one of the cases where you have to build the BLAST database from ASN.1 format data. However as you have noticed it appears that the BLAST databases provided by NCBI, at least 'nt' and 'nr' are missing the additional information for '%S' (and '%L'). This could be related to compatibility with the legacy NCBI BLAST programs, might be a decision made due to the resulting increase in database file size or it could be that the methods used to create these databases have problems with including this information. In either case it looks like your best bet is to contact the BLAST folks at NCBI (see http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs) and see if they can help with further information about which of their databases contain this information, and how to create your own databases containing this data. ADD REPLY 23 Entering edit mode 8.3 years ago Carlos Borroto ★ 2.0k I indeed was able to find my answer in NCBI BLAST documentation: BLAST Command Line Applications User Manual Basically the only taxonomy information stored directly in the BLAST database is the taxid. The rest needs to be pulled from an additional database also provided by NCBI: ftp://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz As per the documentation make sure taxdb database is in the path defined by BLASTDB environment variable. After that you will be able to ask for several additional taxonomic information in the tabular output. For example: $ blastn -db nt -num_threads 24 -max_target_seqs 1 -outfmt '6 qseqid sseqid evalue bitscore sgi sacc staxids sscinames scomnames stitle' -query 229577210.fasta
gi|229577210|ref|NM_001743.4|   gi|229577210|ref|NM_001743.4|   0.0     2418    229577210       NM_001743       9606    Homo sapiens    human   Homo sapiens calmodulin 2 (phosphorylase kinase, delta) (CALM2), mRNA

1
Entering edit mode

I know this was 22 months ago but hopefully people still look at this.  So I was wondering the same question as you Carlos.  I currently have 10,000 sequences that I would like to blast with nt database.  I am only interested in the taxid.  So I set up my BLASTDB and downloaded the taxdb and set it to the same path.

When I use the following code I only get my queryid

$blastn -db nt -max_target_seqs 1 -outfmt '6 qseqid staxids' -query blast.fasta -task blastn Did I forget something? How do I get a taxaid output? ADD REPLY 2 Entering edit mode I know this is 14 months ago and I still look at this.. I've got exactly the same problem with you, I download the taxdb.tar.gz and decompressed it to the path of BLASTDB. Then I got N/A in column sscinames. Would you kindly tell me your solution? ADD REPLY 1 Entering edit mode Hey @ntmarshall406 and @qingxiangg .... Did you find the solution of the problem you had.... I am also facing the same problem... And would appreciate if you can post the solution for this problem. ADD REPLY 1 Entering edit mode Input 'ssciname' instead 'sscinames', trimming the tailing 's'. It worked, but I don't know why. ADD REPLY 1 Entering edit mode Hi, I was facing the same problem and then I realized that the blast command must be in the same directory as the nr database and the 2 taxdb files (.bti , .btd). It solved the problem. Good luck! ADD REPLY 0 Entering edit mode This reply was VERY USEFULL. I would have never figured this out on my own. I have been trying to get blastn working for two days straight untill I found your comment hidden below a few answers that I had read mutliple times already. ADD REPLY 0 Entering edit mode Hi, I am facing the same problem. Did anyone firgure it out? EDITED: It worked. I forgot to provide the input of "-taxid_map tax_id" in makeblastdb command. ADD REPLY 0 Entering edit mode @Carlos, I owe you a beer! ADD REPLY 0 Entering edit mode How would one get this to work -remote. I am interested in submitting my jobs directly to the NCBI servers and using their data to generate the scientific names. However, right now it only returns NA. Any suggestions? ADD REPLY 10 Entering edit mode 6.4 years ago tmkurobe ▴ 100 Copied from NCBI BLAST instruction (link) update_blastdb.pl --decompress nt update_blastdb.pl taxdb gunzip -cd taxdb.tar.gz  1. Download preformatted database files from NCBI using "update_blastdb.pl". You cannot use fasta sequence for creating database files because it doesn't have taxonomy id information. 2. Download a taxid gunzip file, and then 3. unzip it 4. Create a path to the blast database files export BLASTDB=$BLASTDB:/media/Data/path/to/your/database/files/


It worked for me.

1
Entering edit mode

tar -xzf taxdb.tar.gz

2
Entering edit mode
6 months ago
bdy8 ▴ 70

Hi All

I know this is an older post but for anyone searching and trying to get this to work now I thought I would put how I got all this to work (after much exasperation and my computer nearly going through the window). Please note I downloaded and followed all instructions for installation of BLAST from the NCBI manual.

I was having trouble with this (due to firewalls at my institution) but the method I went with was as follows (this should obviously be done in the BLASTDB directory you have created).
wget "ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt.??.tar.gz" # NB: can take a while.
for file in *.gz # for loop unzips and then removes the zipped file.
do
tar -zxvpf "$file" rm "$file"
done

.
2. As @tmkurobe mentions, then downloading the taxdb databases (i.e. files with taxonomy in them). NB. This needs to be in the same folder as your nt database (i.e. BLASTDB).
update_blastdb.pl taxdb
tar -xzf taxdb.tar.gz


.

1. Updating your path variable (in .zshrc, .bash_profile, whatever you use) with the respective path to your BLASTDB directory which has all nt files and the taxadb files.
export BLASTDB=\$BLASTDB:/path/to/BLASTDB/directory


.

1. Restarting terminal so that the BLASTDB path you have in your .zshrc (or anything else) takes hold and can be used (currently blanking in unix command to restart terminal from within apologies). .
2. Running the blast command everything now seems to be working well and taxonomies are being assigned, yay :).
blastn -db /Users/benyoung/blast/BLASTDB/nt \
-query /path/to/query/fasta \
-out /path/to/output/location \
-max_target_seqs 5 \
-max_hsps 5 \
-word_size 60 \
-outfmt "6 qseqid sseqid evalue pident stitle staxids sscinames scomnames sblastnames sskingdoms salltitles stitle" \


.
Update 29th May.

From some more work I have then found this wonderful package from zyxue on github. I am not going to go into detail but elevator pitch, you take the accession numbers and an additional taxa file from NCBI (that you need to download) and it will output the full taxonomy in a really nice format. All info on how to run is on the github (use link below).

https://github.com/zyxue/ncbitax2lin

1
Entering edit mode
6.9 years ago

Step-by-step guide to building your taxdb, including a simple (but hack) way of generating your taxid_map.txt file (gi or accession, and NCBI species ID): http://www.verdantforce.com/2014/12/building-blast-databases-with-taxonomy.html