Hello, everyone! This is my first post on this blog. I've been attempting to write Python code to download every nt chunk from NCBI (https://ftp.ncbi.nlm.nih.gov/blast/db/) (nt.00.tar.gz, nt.01.tar.gz, etc) and their md5 files (nr.00.tar.gz.md5) and extract them all within a database called "nt" so I can use this command:
blastn -query sequences.fasta -db nt -task 'blastn' -num_threads 48 -evalue '0.001' -max_target_seqs '1' -outfmt "6 qseqid qlen sseqid slen salltitles pident qcovhsp evalue staxid ssciname sblastname sskingdom staxids" -out hits.txt
However, I'm getting this error:
BLAST Database error: No alias or index file found for nucleotide database [data/vinicius/nt] in search path [/data/vinicius/SRR13426333::]
I'm aware that the update_database.pl code from blast already exists to accomplish this, but I'd like to create something similar in Python.
What does this mean? You need to download all
nt
files. You can't do only some. Preformatted index needs all file pieces to be in the same directory.nt.nal
file defines the alias and file pieces. Is that file in your directory?If you have all pieces downloaded then simply provide full path to the folder containing the files in
-db /full_path_to/nt
.Isn't
already complete? I think the chunking is just to handle database updates
That is simply fasta version of
nt
not pre-formatted database files, which are the pieces OP was referring to. Sounds like OP wants to run blast searches withnt
.Yep! And updating your database with nt parts is much simpler. You can save time by not downloading the entire database each time.
Yes, I intend to write Python code to handle database updates. Blast already has one, but it was written in Perl. But thank you so much for your advice!
I'm downloading all of the nt files and storing them in a directory called "nt," which also contains nt.nal. My coding is downloading and extracting all of the files that update_databse.pl downloads and extracts. I believe the issue is occurring during the extraction of the nt files. My friend suggested that the Python package I'm using might be corrupting the file being extracted. This was the command:
So, I will try use a the subprocess package to call the tar function from Linux and see if it'll work, like this:
I'm also using a path like
-db /full_path_to/nt
. I don't believe the issue is with the blastn command.That is possible.
Perhaps you are already matching the MD5 sums. If not do that as well in case the download itself is corrupting the files.
Maybe you should try a different tar command? At least run it in verbose mode adn use gzip on a single
nt.xx.tar.gz
file: