Question: Scientific Names In Blast Output And Databases
9
gravatar for Carlos Borroto
4.4 years ago by
Carlos Borroto1.5k
Washington Metropolitan Area
Carlos Borroto1.5k wrote:

Hi,

I'm interested in getting the scientific names of my blast hits ran locally. I see blast+ search apps have option -outfmt which can take sscinames(seems new in version Blast+ 2.2.28), but even using nt from NCBI(no luck with local databases either) I get N\A for this specifier. Similarity for '%S' of -outfmt in blastdbcmd.

For example:

$ blastdbcmd -db nt -entry 229577210 -outfmt '%a || %g || %T || %S || %t'
NM_001743.4 || 229577210 || 9606 || N/A || Homo sapiens calmodulin 2 (phosphorylase kinase, delta) (CALM2), mRNA

Until now I've been using taxids in a very convoluted way. I will get the GIs from my hits, then query the blast db using blastdbcmd to get the taxid and then query the local copy of the NCBI taxonomy database with bioperl to get the scientific name. Now that I see blast+ seems to be able to directly output the scientific name, I would like to simplify things. I'm already able to simplify things a little using the also new output format specifier staxids, so I can now get the taxid directly from the blast output.

So my questions is.

  • Is there a way to build local blast databases in a way so 'sscinames' can be used to output the scientific name in blast+ results?

In a side note. If there is a way, it seems odd NCBI's nt is not built using it. At least that is the case for the version I got from Jul 11 2013.

Thanks in advance,
Carlos
EDIT: I found I can now use staxids to simplify my life a little. Some additional question formatting. NT updated to version from Jul 11.

blast blast+ • 14k views
ADD COMMENTlink modified 2.3 years ago by tmkurobe40 • written 4.4 years ago by Carlos Borroto1.5k

Generally the sequence headers are taken from the fasta sequences. So if the fasta header has the info then blast output will display it. makeblastdb is used to create a local database.

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by Bharat Iyengar230

Sorry, but I think it is more complicated than that. For example, the taxid won't be parsed from the fasta header. If you want your locally build blast database to have taxid information for each record, you need to provide a gi to taxid map file. You can do this using makeblastdb option -taxid_map. My question is how can I now include scientific names when building a blast database so I can use the new output format specifier sscinames.

Thanks, Carlos

ADD REPLYlink written 4.4 years ago by Carlos Borroto1.5k

Since the input has to include the information for it to be available in the BLAST database, I suspect this is one of the cases where you have to build the BLAST database from ASN.1 format data. However as you have noticed it appears that the BLAST databases provided by NCBI, at least 'nt' and 'nr' are missing the additional information for '%S' (and '%L').

This could be related to compatibility with the legacy NCBI BLAST programs, might be a decision made due to the resulting increase in database file size or it could be that the methods used to create these databases have problems with including this information. In either case it looks like your best bet is to contact the BLAST folks at NCBI (see http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs) and see if they can help with further information about which of their databases contain this information, and how to create your own databases containing this data.

ADD REPLYlink modified 4.3 years ago • written 4.4 years ago by Hamish3.0k
19
gravatar for Carlos Borroto
4.3 years ago by
Carlos Borroto1.5k
Washington Metropolitan Area
Carlos Borroto1.5k wrote:

I indeed was able to find my answer in NCBI BLAST documentation:
BLAST Command Line Applications User Manual

Basically the only taxonomy information stored directly in the BLAST database is the taxid. The rest needs to be pulled from an additional database also provided by NCBI:
ftp://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz

As per the documentation make sure taxdb database is in the path defined by BLASTDB environment variable. After that you will be able to ask for several additional taxonomic information in the tabular output. For example:

$ blastn -db nt -num_threads 24 -max_target_seqs 1 -outfmt '6 qseqid sseqid evalue bitscore sgi sacc staxids sscinames scomnames stitle' -query 229577210.fasta
gi|229577210|ref|NM_001743.4|   gi|229577210|ref|NM_001743.4|   0.0     2418    229577210       NM_001743       9606    Homo sapiens    human   Homo sapiens calmodulin 2 (phosphorylase kinase, delta) (CALM2), mRNA
ADD COMMENTlink modified 4.3 years ago • written 4.3 years ago by Carlos Borroto1.5k

I know this was 22 months ago but hopefully people still look at this.  So I was wondering the same question as you Carlos.  I currently have 10,000 sequences that I would like to blast with nt database.  I am only interested in the taxid.  So I set up my BLASTDB and downloaded the taxdb and set it to the same path.

When I use the following code I only get my queryid

$ blastn -db nt -max_target_seqs 1 -outfmt '6 qseqid staxids' -query blast.fasta -task blastn

Did I forget something?  How do I get a taxaid output?

ADD REPLYlink written 2.5 years ago by ntmarshall4060

I know this is 14 months ago and I still look at this.. I've got exactly the same problem with you, I download the taxdb.tar.gz and decompressed it to the path of BLASTDB. Then I got N/A in column sscinames. Would you kindly tell me your solution?

ADD REPLYlink written 15 months ago by qingxiangg20

Hey @ntmarshall406 and @qingxiangg .... Did you find the solution of the problem you had.... I am also facing the same problem... And would appreciate if you can post the solution for this problem.

ADD REPLYlink written 8 months ago by vishwaas170420
1

Input 'ssciname' instead 'sscinames', trimming the tailing 's'. It worked, but I don't know why.

ADD REPLYlink written 6 months ago by yachenhu10

Hi, I am facing the same problem. Did anyone firgure it out?

EDITED:

It worked. I forgot to provide the input of "-taxid_map tax_id" in makeblastdb command.

ADD REPLYlink modified 6 months ago • written 6 months ago by AsoInfo220

Hi, I was facing the same problem and then I realized that the blast command must be in the same directory as the nr database and the 2 taxdb files (.bti , .btd). It solved the problem. Good luck!

ADD REPLYlink written 10 weeks ago by hodayabeer0

@Carlos, I owe you a beer!

ADD REPLYlink written 4 months ago by tlorin210

How would one get this to work -remote. I am interested in submitting my jobs directly to the NCBI servers and using their data to generate the scientific names. However, right now it only returns NA. Any suggestions?

ADD REPLYlink written 16 days ago by travis.m.couture30
4
gravatar for tmkurobe
2.3 years ago by
tmkurobe40
United States
tmkurobe40 wrote:

Copied from NCBI BLAST instruction (link)

$ update_blastdb.pl --decompress nt

$ update_blastdb.pl taxdb

$ gunzip -cd taxdb.tar.gz

 

  1. Download preformatted database files from NCBI using "update_blastdb.pl". You cannot use fasta sequence for creating database files because it doesn't have taxonomy id information.
  2. Download a taxid gunzip file, and then
  3. unzip it
  4. Create a path to the blast database files

export BLASTDB=$BLASTDB:/media/Data/path/to/your/database/files/

It worked for me.

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by tmkurobe40

FWIW, I had to use

"tar -xzf taxdb.tar.gz"

 

ADD REPLYlink written 2.2 years ago by cedric.laczny30
1
gravatar for ryan.m.harrison
2.9 years ago by
United Kingdom
ryan.m.harrison30 wrote:

Step-by-step guide to building your taxdb, including a simple (but hack) way of generating your taxid_map.txt file (gi or accession, and NCBI species ID): http://www.verdantforce.com/2014/12/building-blast-databases-with-taxonomy.html

ADD COMMENTlink written 2.9 years ago by ryan.m.harrison30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 707 users visited in the last hour