NCBI NR protein db nr.gz FASTA inflate error?
2
1
Entering edit mode
12 weeks ago
nina.maryn ▴ 20

Hello all, I'm trying to download and makedb for the nr.gz FASTA file from NCBI. I originally used wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz

to download the nr.gz file. It worked (seemingly). But when I try to run $diamond makedb --in nr.gz -d nr

I get the following error:

#CPU threads: 64
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
Database input file: /global/scratch/users/*****/*****/nr.gz
Opening the database file...  [0.028s]
Loading sequences...  [1.93s]
Error: Inflate error.

I then tried $fixgz nr.gz nr.fixed.gz and ran diamond makedb again, and got the same error:

#CPU threads: 64
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1);
Database input file: /global/scratch/users/*****/*****/nr.fixed.gz
Opening the database file...  [0.031s]
Loading sequences...  [0.118s]
Error: Inflate error.

I've also tried to gunzip nr.gz and and nr.fixed.gz and get gzip: nr.fixed.gz: invalid compressed data--format violated

How do I successfully download the nr.gz file? It's huge and it sounds like ftp is often unstable, so the file gets corrupted? I've tried doing it multiple times with the same result. Is there an older version of nr.gz I could use?

nr.gz alignment Megan6 db DIAMOND NCBI • 1.0k views
ADD COMMENT
1
Entering edit mode

Before you do anything, the integrity of your file can be tested using the -t switch:

gunzip -t nr.gz

Beware that it will take a long time. If the file is corrupted, I suggest you try downloading it with aria2. In my hands it is much faster than wget because it uses multiple connections, and also has the ability to restart so there should be no issues with corruption.

ADD REPLY
0
Entering edit mode

Thanks! I'm downloading aria2 now. What options have you used in the past? I'm checking out the documentation, but in case you already have a line of code that would be helpful :)

ADD REPLY
0
Entering edit mode

That error suggests that your file is likely corrupt. Try @mensur's suggestion to confirm file integrity. Looks like you are using a central compute resource so the download should not run into problems. NCBI FTP is not unstable, if anything, it may be your local firewall that is causing the problem.

I just tested a fresh download of nr.gz and diamond started making the indexes without any error so the file at NCBI seems to be fine. Be sure to allocate enough RAM for this task if you are using a cluster.

ADD REPLY
0
Entering edit mode

How long did it take you to download the gz file? I'm trying aria2 now and it's running, but I'm curious how long I should expect (it was nearly 10 hours using wget)

ADD REPLY
2
Entering edit mode

It was under 30 min using wget.

ADD REPLY
0
Entering edit mode

So i'm running $aria2c -x16 -k1M "ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz" and it's been running for over an hour. The size of nr.gz is 117317190541

The output says it's only at like 30%. Am I doing something wrong?

ADD REPLY
1
Entering edit mode

Probably. Again this could be due to multiple factors.

  1. Like Mensur Dlakic said NCBI may be limiting bandwidth for your download. Looks like you are using 16 connections? You should have used less.
  2. Your local cluster admins may be limiting bandwidth to ensure that you don't saturate the external cluster network connection.
  3. It could be something upstream e.g. at the core router of your institution where they may be limiting bandwidth used for specific processes.

If you are at 30% now then another 2.5h should see the download complete.

ADD REPLY
0
Entering edit mode

The command I use with aria2:

aria2c --continue=true --max-connection-per-server=4 --min-split-size=1M ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz

It took ~26 minutes (~66 Mb/s). You may be tempted to use more than 4 connections, but NCBI may not like that and could throttle down your IP address.

ADD REPLY
0
Entering edit mode

Hm, it says that my speed is 14MiB/s on average. Is there a way to speed it up? I tried using a 60M minimum speed and it crashed.

ADD REPLY
0
Entering edit mode

I have already answered both of your last two questions - please read what I wrote. Sometimes less is more, so your 16 connections are probably causing NCBI to slow down your download. Also what GenoMax added, which explains other potential factors that may be unique to your internet connection.

Now, if it took you 10+ hours last time and it will likely be less than 5 with aria2, that's still a significant speed-up. Sometimes we just need to accept things as they are.

ADD REPLY
0
Entering edit mode

I did read your answers. I'm not asking because I'm concerned about time, I'm concerned that this has to do with why my file is being corrupted. After the download completed, it was still corrupted. gunzip -t nr.gz returned a formatting error

ADD REPLY
0
Entering edit mode

With all due respect, you specifically asked Is there a way to speed it up? which seems more of a concern about speed than file corruption.

Both GenoMax and I downloaded today's copy of nr.gz without any issues, so I don't think anything is wrong with the file. That leaves software on your side (what is your gzip version? mine is 1.6), the integrity of your hard disk, or something with your internet connection as has already been pointed out. Maybe consulting your local admins will help you troubleshoot it.

ADD REPLY
1
Entering edit mode
12 weeks ago
jwojwo ▴ 20

I was also having a similar issue but with nt database.

It seems to be an FTP bug, as stated here https://github.com/sherrillmix/taxonomizr/issues/27 and here https://bugs.launchpad.net/ubuntu/+source/wget/+bug/1921064.

I can confirm that replacing FTP with HTTP in the URL solves the issue (either using wget or aria2). (e.g., wget http://ftp.ncbi.nlm.nih.gov/blast/db/nt.00.tar.gz)

wget http://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz

should work for you.

ADD COMMENT
0
Entering edit mode
12 weeks ago
buchfink ▴ 200

Other solution: use update_blastdb.pl to download the database in BLAST format. This script will download in chunks and verify the MD5 hashes of all files. You can then use blastdbcmd to convert to FASTA format, or use the latest Diamond version which directly supports BLAST databases.

ADD COMMENT
0
Entering edit mode

latest Diamond version which directly supports BLAST databases

buchfink : Can this be prominently noted on GitHub main page? I thought I remembered this to be the case but it was not on the landing page so did not recommend it to OP.

ADD REPLY

Login before adding your answer.

Traffic: 2594 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6