downloading (and using) a NCBI BLAST database
3
0
Entering edit mode
2.2 years ago
wiscoyogi ▴ 40

question: how to convert downloaded tar.gz and .gz.md5 files to a database blastn can work with?

I downloaded the nr database to do some command line blast with the ncbi-blast package with the following command: ‘perl update_blastdb.pl —decompress nr’

the database download completes, but I got an error saying that ‘decompress’ was not found and I got a lot of tar.gz and tar.gz.md5 files (files indexed 0 to 55).

I tried running blast with blastn -query dummy.fasta -db $path/to/nr_db -num_threads 8 -out dummy.out and it failed, saying “BLAST Database error: No alias or index file found for nucleotide database”

Then I tried to uncompress the downloaded database files (since with gunzip -cd *.tar.gz | tar xvf - And this yielded a log file > 6 GB within 5 minutes of running and I killed the job.

I’m just trying to run blast and I’m not following how to go from several gzipped tar files to a data since a gzipped database is inappropriate.

thanks!

blastn blast • 5.9k views
ADD COMMENT
0
Entering edit mode

It is odd that you were not able to decompress the downloaded files. Yes they need to be all decompressed and need to stay in one directory. nr database is close to 300-400G worth of data so the uncompress job will run for a while and will need enough space to be available. You should not need to capture logs unless your download has somehow been corrupt and that is generating error messages. In that case you will need to redownload the data.

ADD REPLY
0
Entering edit mode

thanks for your response. is there a good way to decompress the downloaded files without using the --decompress flag?

ADD REPLY
0
Entering edit mode
for f in *.tar.gz ; do tar xvzf $f ; done
ADD REPLY
0
Entering edit mode

Also, your error saying "decompress not found" comes from copy-pasting the wrong unicode character in ‘perl update_blastdb.pl —decompress nr’ which maybe has fallen victim to some auto-correction. may look like a dash, but it is what some autocorrection (word, email client) makes from --. So, just retype the command next time, then it should work.

ADD REPLY
2
Entering edit mode
2.2 years ago
Michael 54k

Hope I am not stating the obvious. I am not sure what you want to achieve, but it is important to check out what kind of database you want to save yourself some time and space downloading.

Even if you had managed to extract the files, the following command will never work:

blastn -query dummy.fasta -db $path/to/nr_db -num_threads 8 -out dummy.out and it failed, saying “BLAST 
Database error: No alias or index file found for nucleotide database”

First, nr is a protein database, so it will never work with blastn. You need to either use blastp or blastx, or if you want to have a nucleotide db, you need to download nt. Second, once you download, the dbs are called nr or nt. So check what you need first.

ADD COMMENT
0
Entering edit mode

I downloaded the nt database then submitted a job where -db nt (not the path to nt) and this got running. thank you. not specifying the path of the database and just writing 'nt' was non-obvious (to me at least).

separate question: do you have recommendations for speeding this up or shortening the number of reads per file? this is going to take awhile.

ADD REPLY
0
Entering edit mode

If you are already using multiple threads (depending on cores you have access to) then you are already going "pedal to the metal". Can't speed this up any further on your hardware. If you are not using multiple cores then use -num_threads 8 (with an appropriate number of cores you have).

ADD REPLY
0
Entering edit mode

I agree, there is not much to do if you really need the blastn vs. nt strategy. See this discussion: https://biostars.org/p/492476/ and especially point 3.: reduce database size.

ADD REPLY
0
Entering edit mode
2.2 years ago
Mensur Dlakic ★ 27k

Presumably you have a bunch of files with names like this:

nr.13.tar.gz

They will have a different number from 13 up to 55. To unpack them:

tar -zxvof nr.13.tar.gz

Repeat that for all the files (a short script will automate) and you will have the database unpacked. You may need to delete some of these files after unpacking unless you know for sure that disk space is not an issue.

ADD COMMENT
0
Entering edit mode

what about the nr.*.tar.gz.md5 files?

ADD REPLY
1
Entering edit mode

They don't have to be unzipped sequentially. MD5 files are checksums and can be deleted after successful decompression of .gz files. They are there for file integrity purposes if something goes wrong during the decompression.

ADD REPLY
0
Entering edit mode

thanks for the additional information! helpful context

ADD REPLY
0
Entering edit mode
2.2 years ago

Downloading and maintaining local copies of BLAST databases has a substantial learning curve, as you are discovering. You may wish to try BIRCH, which has a complete set of automated tools for BLAST databases, run through an easy to use graphical interface. BIRCH's blastdbkit generates disk usage reports for BLAST databases, downloads, verifies and decompresses the downloads, and has an update mechanism that only downloads database files newer than those currently installed. Because database files are huge, downloads can fail. Simply restart blastdbkit and the download will pick up where it left off.

The video Installing BLAST databases on your own computer will take you through the considerations associated with local BLAST databases, and show the blastdbkit in action.

ADD COMMENT

Login before adding your answer.

Traffic: 2645 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6