Question

cannot download refseq genomes

1

Entering edit mode

3.2 years ago

sapuizait ▴ 10

Dear all

I have been to trying to download all complete bacterial genomes (specifically their faa aa sequences) from refseq in order to create a diamond database however I can only download successfully a very small portion of them!

What I do is:

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt
awk -F '\t' '{if($12=="Complete Genome") print $20}' assembly_summary.txt > assembly_summary_complete_genomes.txt
awk 'BEGIN{FS=OFS="/";filesuffix="protein.faa.gz"}{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print ftpdir,file}' assembly_summary_complete_genomes.txt > ftpfilepaths
cat ftpfilepaths | parallel -j 20 --verbose --progress "curl -O {}"
gunzip *gz

and here comes the issue! Out of the approximately 20.000+ protein.faa.gz files only ca 200-400 can be extracted properly and for the rest I get an "invalid compressed data--format violated"

If I try to make a new ftpfile using only the "corrupted" gz files and try to re-download, once again it will download everything but only another 300-400 gz will be uncompressed successfully while the rest are "invalid compressed data"

If I try to download with curl the failed to uncompress gz files, then the files can be downloaded and uncompress fine - so the problem is when using parallel? or when I am trying to download all of them together?

I remember getting a similar error in the past but it was only for a very small portion of the data, so I am wondering what is going on?

I am also very troubled about the fact that I cannot find any internet threads with a similar issue, am I the only one getting this? Am I doing sth fundamentally wrong?

Thanks

P

refseq • 1.6k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 3.2 years ago by sapuizait ▴ 10

score 2 · Answer 1 · 2021-02-27

2

Entering edit mode

3.2 years ago

GenoMax 142k

This is likely being caused by your use of parallel. Your IP may be getting flagged by trying to download multiple streams of data from NCBI and/or your network connection may simply be getting saturated leading to data corruption.

Please use a proper tool like datasets (LINK) to download the data. ~~parallel has its uses but don't use it with a shared public resource.~~

ADD COMMENT • link 3.2 years ago by GenoMax 142k

0

Entering edit mode

thanks, will give datasets a try - however, out of curiosity, I have used parallel in this context plenty of times in the past - I mean A LOT, and I never had that problem... so, has sth changed recently?

ADD REPLY • link 3.2 years ago by sapuizait ▴ 10

0

Entering edit mode

Have you tried to reduce the number of parallel jobs to see if that results in successful downloads? Looks like you are using 20 at the moment.

There could be many reasons why this is no longer working (since it did in past). My speculative list in random order:

NCBI is now enforcing bandwidth restrictions on specific IP
Lot of NCBI infrastructure has been moving to the cloud so there is a performance hit at NCBI end/bandwidth limit. Though this feels unlikely.
Something at your local end is causing a network bandwidth issue and/or I/O writes to disk leading to data corruption.

I don't know how often the list of assemblies changes (weekly?) but perhaps you can just download the entries that changed instead of getting the entire set each time?

ADD REPLY • link 3.2 years ago by GenoMax 142k

0

Entering edit mode

less jobs in parallel did the trick, only a 1% failed - but started using the datasets as well, thanks!

ADD REPLY • link 3.2 years ago by sapuizait ▴ 10

score 1 · Answer 2 · 2021-02-27

Genome updater:

https://github.com/pirovc/genome_updater

This command will download all complete bacterial genomes from RefSeq:

genome_updater.sh -d "refseq" -g "bacteria" -c "all" -l "Complete Genome" -f "genomic.fna.gz" -o genomes/bacteria -t 10 -u -m -a

For proteomes:

genome_updater.sh -d "refseq" -g "bacteria" -c "all" -l "Complete Genome" -f "protein.faa.gz" -o proteomes/bacteria -t 10 -u -m -a

Beware that this will take at least half a day, possibly longer. As you were told already, a larger number of threads may prompt NCBI to slow down the transfer for your IP number (5-10 works fine for me).