BLAST nr download and build - "nr.gz: invalid compressed data--format violated"
2
1
Entering edit mode
2.2 years ago

I am trying to download and build the NCBI BLAST nr database, but I am running into the same problem constantly. The download seems to be successful, I don't receive any error messages and the size of the downloaded file is appropriate. Still, when I try to use the database in DIAMOND, I get the following error message:

Error: Error reading input stream at line 360378522: Invalid character (>) in sequence

By running the following command, it seems like there is something wrong with the fasta.gz file:

zcat nr.gz | sed -n 360378500,360378540p

The problematic part looks like this:

>MBY6275551.1 hypothetical protein [Symbiobacterium thermophilum]
MRWSDVPVENKAQIAWGALVVVYLLGAQRPELIRPPVVAVFLLSAAAAFLELWLRSRGWPHLLWYDCIVWSALLTGMVVV
TGGRGSEVWAAYILMSLTAPVVLRRVAPYILLGVNVTAYGLIYLLYNPFGAPLDWGLLFLRIGTIFLVAYVVDRSTARER
QSHARAVALARSRVSELVQARDAERRRIAHDIHDWLGTGIIAPLRRLEIAARQSDVESCRRHVEEAADSLRRAHAELRRL
MENLHPHLLEQMGLAEALRAYLTDWGEEHGVAVHYHLTPGPEPPADAALALYRIQQEALNNCAKHADASQVWVTLELGAN
QVRLTLRDDGHGRPGRPGRGlacgTonARAVATIDPDPACSARQVGRWFYRAGKTGWQPLTRTSAGGCRDVHRVEPDFPEA
LCATAVRDILuFPkedLAELRRL
RQVGR>WJoIST TCP 71-13kei]NVFSIVGRWLPKQLGLKLAHCYEKNVEF guGIPH
MDCKCVa marETGNQGNSQGCEKV]OVseudal TNNRRPS
>GLDEF resYPKEGoPGTRLQVAAKATSYYPNQEMNILYSGGLDEERFPWTLNERVANDMRRmin1PNRAGRNRVANVVDAQLEETTPNISTAE

After this, I get the following error message:

gzip: nr.gz: invalid compressed data--format violated

This is the same error message I get when I run

gunzip -t nr.gz

Interestingly I couldn't solve the problem by deleting and re-downloading the database and the position of the first invalid character varied between runs, so it doesn't seem like the problem is with the original file. I tested three ftp-downloading methods so far, I got the same end results with all three of them: FileZilla, wget and aria2. What am I doing wrong? What other methods can I try?

The commands I used to get the databases:

wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz
aria2c --continue=true --max-connection-per-server=4 --min-split-size=1M ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz

For FileZila, I simply navigated to the ftp site "ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz" and downloaded the nr.gz file

gzip nr BLAST • 4.0k views
ADD COMMENT
3
Entering edit mode
2.1 years ago

FOLLOW UP:

I finally solved this problem by downloading the file through FileZilla and setting "Transfer type" from "Automatic" to "Binary"

ADD COMMENT
1
Entering edit mode

I got the similar issue with the filezilla downloaded file. Your solution helps me out! Thank you very much!

ADD REPLY
0
Entering edit mode
2.2 years ago
Mensur Dlakic ★ 27k

Seems like the file is corrupted, even though all your download procedures appear to be done properly. I suggest you fetch the matching MD5 file from the same directory, and verify the integrity of your download with md5sum.

https://www.hiroom2.com/2017/05/08/linux-compare-md5sum-checksum/

ADD COMMENT

Login before adding your answer.

Traffic: 2222 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6