Question: Problems With Blast And Nr Database
0
gravatar for Daniel Standage
8.7 years ago by
Daniel Standage3.9k
Davis, California, USA
Daniel Standage3.9k wrote:

I'm familiar with the BLAST family of software: I've used both the old interface (blastall, formatdb, et al) and the new interface (blastx, makeblastdb, et al). However, I've always used it with in-house databases. I've never tried downloading and using NCBI's non-redundant database...which is what I'm trying to do now.

Turns out someone in our lab recently downloaded the nr and nt databases using the update_blastdb.pl script, so that saves me that trouble. However, I am having issues when I try to run BLAST against the database.

I created a Fasta file that has a single query sequence in it...maybe several hundred bp long. When I just do a simple command like one of the two below, it runs without any end in sight (consuming a lot of RAM too).

$ blastall -p blastx -i test.fasta -d /data/blast/db/nr -m 7
^C
$ blastall -p blastn -i test.fasta -d /data/blast/db/nt -m 7
^C

So I though 'ok, maybe I'm supposed to point it at the alias file', so I tried the following commands, ending immediately in an error.

$blastall -p blastx -i test.fasta -d /data/blast/db/nr.pal -m 7
[blastall] FATAL ERROR: AT1G51370.2: Database /data/blast/db/nr.pal was not found or does not exist
$ blastall -p blastn -i test.fasta -d /data/blast/db/nt.pal -m 7
[blastall] FATAL ERROR: AT1G51370.2: Database /data/blast/db/nt.pal was not found or does not exist

I've run fastacmd to make sure the databases are working correctly and I don't see any problems.

fastacmd -d /data/blast/db/nr -I
Database: All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF
excluding environmental samples from WGS projects 
           10,688,764 sequences; 3,647,636,407 total letters

File names:
/data/blast/db/nr.00
   Date: Mar 25, 2010  5:42 PM    Version: 4    Longest sequence: 36,805 res
/data/blast/db/nr.01
   Date: Mar 25, 2010  5:42 PM    Version: 4    Longest sequence: 35,213 res
/data/blast/db/nr.02
   Date: Mar 25, 2010  5:42 PM    Version: 4    Longest sequence: 33,423 res
/data/blast/db/nr.03
   Date: Mar 25, 2010  5:42 PM    Version: 4    Longest sequence: 33,423 res

$ fastacmd -d /data/blast/db/nt -I
Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS,
GSS,environmental samples or phase 0, 1 or 2 HTGS sequences) 
           11,257,610 sequences; 30,637,862,539 total letters

File names:
/data/blast/db/nt.00
   Date: Mar 25, 2010  2:13 PM    Version: 4    Longest sequence: 7,215,267 bp
/data/blast/db/nt.01
   Date: Mar 25, 2010  2:13 PM    Version: 4    Longest sequence: 9,105,828 bp
/data/blast/db/nt.02
   Date: Mar 25, 2010  2:13 PM    Version: 4    Longest sequence: 7,074,893 bp
/data/blast/db/nt.03
   Date: Mar 25, 2010  5:42 PM    Version: 4    Longest sequence: 6,365,727 bp
/data/blast/db/nt.04
   Date: Mar 25, 2010  5:42 PM    Version: 4    Longest sequence: 27,905,053 bp
/data/blast/db/nt.05
   Date: Mar 25, 2010  5:42 PM    Version: 4    Longest sequence: 13,033,779 bp
/data/blast/db/nt.06
   Date: Mar 25, 2010  2:13 PM    Version: 4    Longest sequence: 8,545,929 bp
/data/blast/db/nt.07
   Date: Mar 25, 2010  5:42 PM    Version: 4    Longest sequence: 10,467,782 bp
/data/blast/db/nt.08
   Date: Mar 25, 2010  5:42 PM    Version: 4    Longest sequence: 10,341,314 bp

Any ideas what the issue might be?

database blast • 9.7k views
ADD COMMENTlink written 8.7 years ago by Daniel Standage3.9k
4

Your first commands should be correct. How long have you let them run? Searching the nr/nt databases might take a long time, you should probably try a smaller database first as a proof of concept.

ADD REPLYlink written 8.7 years ago by Michael Schubert6.9k
1

update_blastdb.pl how did your co-worker get it working? We are having difficulties http://www.biostars.org/post/show/50506/what-is-the-best-way-to-download-genbank-locally/#50561

ADD REPLYlink written 7.0 years ago by diltsjeri440
3
gravatar for Pawel Szczesny
8.7 years ago by
Pawel Szczesny3.2k
Poland
Pawel Szczesny3.2k wrote:

As far as I have noticed, current sizes of NCBI's databases are hardly compatible with single-core usage. In my recent tests BLASTP of ~300AA sequence against NR database took ca. 10 minutes on machine with 16 cores and 72GB of RAM. It's not very informative but at least it should give you an idea about requirements of BLAST with current databases.

The other issue is that old C-based NCBI toolkit is significantly slower than the new one, written in C++ and referred as BLAST+ applications. Make sure you're using the most recent version of BLAST+ (the older ones had some problems with stability).

Newest software, lots of RAM and parallelization are probably the cure for your problems.

ADD COMMENTlink written 8.7 years ago by Pawel Szczesny3.2k

Thanks. I was just surprised a single sequence would take this long based on my previous experience--of course I never worked with databases as big as NR.

ADD REPLYlink written 8.7 years ago by Daniel Standage3.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 544 users visited in the last hour