I am trying to download and build the NCBI BLAST nr database, but I am running into the same problem constantly. The download seems to be successful, I don't receive any error messages and the size of the downloaded file is appropriate. Still, when I try to use the database in DIAMOND, I get the following error message:
Error: Error reading input stream at line 360378522: Invalid character (>) in sequence
By running the following command, it seems like there is something wrong with the fasta.gz file:
zcat nr.gz | sed -n 360378500,360378540p
The problematic part looks like this:
>MBY6275551.1 hypothetical protein [Symbiobacterium thermophilum]
MRWSDVPVENKAQIAWGALVVVYLLGAQRPELIRPPVVAVFLLSAAAAFLELWLRSRGWPHLLWYDCIVWSALLTGMVVV
TGGRGSEVWAAYILMSLTAPVVLRRVAPYILLGVNVTAYGLIYLLYNPFGAPLDWGLLFLRIGTIFLVAYVVDRSTARER
QSHARAVALARSRVSELVQARDAERRRIAHDIHDWLGTGIIAPLRRLEIAARQSDVESCRRHVEEAADSLRRAHAELRRL
MENLHPHLLEQMGLAEALRAYLTDWGEEHGVAVHYHLTPGPEPPADAALALYRIQQEALNNCAKHADASQVWVTLELGAN
QVRLTLRDDGHGRPGRPGRGlacgTonARAVATIDPDPACSARQVGRWFYRAGKTGWQPLTRTSAGGCRDVHRVEPDFPEA
LCATAVRDILuFPkedLAELRRL
RQVGR>WJoIST TCP 71-13kei]NVFSIVGRWLPKQLGLKLAHCYEKNVEF guGIPH
MDCKCVa marETGNQGNSQGCEKV]OVseudal TNNRRPS
>GLDEF resYPKEGoPGTRLQVAAKATSYYPNQEMNILYSGGLDEERFPWTLNERVANDMRRmin1PNRAGRNRVANVVDAQLEETTPNISTAE
After this, I get the following error message:
gzip: nr.gz: invalid compressed data--format violated
This is the same error message I get when I run
gunzip -t nr.gz
Interestingly I couldn't solve the problem by deleting and re-downloading the database and the position of the first invalid character varied between runs, so it doesn't seem like the problem is with the original file. I tested three ftp-downloading methods so far, I got the same end results with all three of them: FileZilla, wget and aria2. What am I doing wrong? What other methods can I try?
The commands I used to get the databases:
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz
aria2c --continue=true --max-connection-per-server=4 --min-split-size=1M ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz
For FileZila, I simply navigated to the ftp site "ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz" and downloaded the nr.gz file
I got the similar issue with the filezilla downloaded file. Your solution helps me out! Thank you very much!