Question

Downloading NT chunks from NCBI and creating a BLAST database

0

Entering edit mode

15 months ago

Vinícius • 0

Hello, everyone! This is my first post on this blog. I've been attempting to write Python code to download every nt chunk from NCBI (https://ftp.ncbi.nlm.nih.gov/blast/db/) (nt.00.tar.gz, nt.01.tar.gz, etc) and their md5 files (nr.00.tar.gz.md5) and extract them all within a database called "nt" so I can use this command:

blastn -query sequences.fasta -db nt -task 'blastn' -num_threads 48 -evalue '0.001' -max_target_seqs '1' -outfmt "6 qseqid qlen sseqid slen salltitles pident qcovhsp evalue staxid ssciname sblastname sskingdom staxids" -out hits.txt

However, I'm getting this error:

BLAST Database error: No alias or index file found for nucleotide database [data/vinicius/nt] in search path [/data/vinicius/SRR13426333::]

I'm aware that the update_database.pl code from blast already exists to accomplish this, but I'd like to create something similar in Python.

nucleotide blast nt blastn database_update.pl • 1.4k views

ADD COMMENT • link updated 15 months ago by lennykovac ▴ 90 • written 15 months ago by Vinícius • 0

1

Entering edit mode

extract them all within a database called "nt"

What does this mean? You need to download all nt files. You can't do only some. Preformatted index needs all file pieces to be in the same directory. nt.nal file defines the alias and file pieces. Is that file in your directory?

If you have all pieces downloaded then simply provide full path to the folder containing the files in -db /full_path_to/nt.

ADD REPLY • link 15 months ago by GenoMax 141k

0

Entering edit mode

Isn't

https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nt.gz

already complete? I think the chunking is just to handle database updates

ADD REPLY • link 15 months ago by LChart 3.9k

0

Entering edit mode

That is simply fasta version of nt not pre-formatted database files, which are the pieces OP was referring to. Sounds like OP wants to run blast searches with nt.

ADD REPLY • link 15 months ago by GenoMax 141k

0

Entering edit mode

Yep! And updating your database with nt parts is much simpler. You can save time by not downloading the entire database each time.

ADD REPLY • link 15 months ago by Vinícius • 0

0

Entering edit mode

Yes, I intend to write Python code to handle database updates. Blast already has one, but it was written in Perl. But thank you so much for your advice!

ADD REPLY • link 15 months ago by Vinícius • 0

0

Entering edit mode

I'm downloading all of the nt files and storing them in a directory called "nt," which also contains nt.nal. My coding is downloading and extracting all of the files that update_databse.pl downloads and extracts. I believe the issue is occurring during the extraction of the nt files. My friend suggested that the Python package I'm using might be corrupting the file being extracted. This was the command:

import tarfile    
if self.database == "nt":
         print("Extracting ", file_name)          
         tar = tarfile.open(path, 'r:gz')
         tar.extractall(path=directory_path)
         tar.close()

So, I will try use a the subprocess package to call the tar function from Linux and see if it'll work, like this:

import subprocess
print("Extracting ", file_name)    
tar = "tar -xf " + path 
print ("The command used was: " + tar)
subprocess.call(tar, shell=True)

I'm also using a path like -db /full_path_to/nt. I don't believe the issue is with the blastn command.

ADD REPLY • link 15 months ago by Vinícius • 0

0

Entering edit mode

That is possible.

Perhaps you are already matching the MD5 sums. If not do that as well in case the download itself is corrupting the files.

ADD REPLY • link 15 months ago by GenoMax 141k

0

Entering edit mode

Maybe you should try a different tar command? At least run it in verbose mode adn use gzip on a single nt.xx.tar.gz file:

tar = "tar -xvzf " + path

ADD REPLY • link 15 months ago by lennykovac ▴ 90