0
0
Entering edit mode
7 days ago
Vinícius • 0

Hello, everyone! This is my first post on this blog. I've been attempting to write Python code to download every nt chunk from NCBI (https://ftp.ncbi.nlm.nih.gov/blast/db/) (nt.00.tar.gz, nt.01.tar.gz, etc) and their md5 files (nr.00.tar.gz.md5) and extract them all within a database called "nt" so I can use this command:

blastn -query sequences.fasta -db nt -task 'blastn' -num_threads 48 -evalue '0.001' -max_target_seqs '1' -outfmt "6 qseqid qlen sseqid slen salltitles pident qcovhsp evalue staxid ssciname sblastname sskingdom staxids" -out hits.txt

However, I'm getting this error:

BLAST Database error: No alias or index file found for nucleotide database [data/vinicius/nt] in search path [/data/vinicius/SRR13426333::]

I'm aware that the update_database.pl code from blast already exists to accomplish this, but I'd like to create something similar in Python.

nucleotide blast nt blastn database_update.pl • 314 views
1
Entering edit mode

extract them all within a database called "nt"

What does this mean? You need to download all nt files. You can't do only some. Preformatted index needs all file pieces to be in the same directory. nt.nal file defines the alias and file pieces. Is that file in your directory?

If you have all pieces downloaded then simply provide full path to the folder containing the files in -db /full_path_to/nt.

0
Entering edit mode

Isn't

https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nt.gz

already complete? I think the chunking is just to handle database updates

0
Entering edit mode

That is simply fasta version of nt not pre-formatted database files, which are the pieces OP was referring to. Sounds like OP wants to run blast searches with nt.

0
Entering edit mode

Yep! And updating your database with nt parts is much simpler. You can save time by not downloading the entire database each time.

0
Entering edit mode

Yes, I intend to write Python code to handle database updates. Blast already has one, but it was written in Perl. But thank you so much for your advice!

0
Entering edit mode

I'm downloading all of the nt files and storing them in a directory called "nt," which also contains nt.nal. My coding is downloading and extracting all of the files that update_databse.pl downloads and extracts. I believe the issue is occurring during the extraction of the nt files. My friend suggested that the Python package I'm using might be corrupting the file being extracted. This was the command:

import tarfile
if self.database == "nt":
print("Extracting ", file_name)
tar = tarfile.open(path, 'r:gz')
tar.extractall(path=directory_path)
tar.close()


So, I will try use a the subprocess package to call the tar function from Linux and see if it'll work, like this:

import subprocess
print("Extracting ", file_name)
tar = "tar -xf " + path
print ("The command used was: " + tar)
subprocess.call(tar, shell=True)


I'm also using a path like -db /full_path_to/nt. I don't believe the issue is with the blastn command.

0
Entering edit mode

That is possible.

Perhaps you are already matching the MD5 sums. If not do that as well in case the download itself is corrupting the files.

0
Entering edit mode

Maybe you should try a different tar command? At least run it in verbose mode adn use gzip on a single nt.xx.tar.gz file:

tar = "tar -xvzf " + path