2
1
Entering edit mode
6 months ago
sapuizait ▴ 10

Dear all

I have been to trying to download all complete bacterial genomes (specifically their faa aa sequences) from refseq in order to create a diamond database however I can only download successfully a very small portion of them!

What I do is:

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt

awk -F '\t' '{if($12=="Complete Genome") print$20}' assembly_summary.txt > assembly_summary_complete_genomes.txt

awk 'BEGIN{FS=OFS="/";filesuffix="protein.faa.gz"}{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print ftpdir,file}' assembly_summary_complete_genomes.txt > ftpfilepaths

cat ftpfilepaths | parallel -j 20 --verbose --progress "curl -O {}"

gunzip *gz

and here comes the issue! Out of the approximately 20.000+ protein.faa.gz files only ca 200-400 can be extracted properly and for the rest I get an "invalid compressed data--format violated"

If I try to make a new ftpfile using only the "corrupted" gz files and try to re-download, once again it will download everything but only another 300-400 gz will be uncompressed successfully while the rest are "invalid compressed data"

If I try to download with curl the failed to uncompress gz files, then the files can be downloaded and uncompress fine - so the problem is when using parallel? or when I am trying to download all of them together?

I remember getting a similar error in the past but it was only for a very small portion of the data, so I am wondering what is going on?

I am also very troubled about the fact that I cannot find any internet threads with a similar issue, am I the only one getting this? Am I doing sth fundamentally wrong?

thanks P

2
Entering edit mode
6 months ago
GenoMax 106k

This is likely being caused by your use of parallel. Your IP may be getting flagged by trying to download multiple streams of data from NCBI and/or your network connection may simply be getting saturated leading to data corruption.

Please use a proper tool like datasets (LINK) to download the data. parallel has its uses but don't use it with a shared public resource.

0
Entering edit mode

thanks, will give datasets a try - however, out of curiosity, I have used parallel in this context plenty of times in the past - I mean A LOT, and I never had that problem... so, has sth changed recently?

0
Entering edit mode

Have you tried to reduce the number of parallel jobs to see if that results in successful downloads? Looks like you are using 20 at the moment.

There could be many reasons why this is no longer working (since it did in past). My speculative list in random order:

1. NCBI is now enforcing bandwidth restrictions on specific IP
2. Lot of NCBI infrastructure has been moving to the cloud so there is a performance hit at NCBI end/bandwidth limit. Though this feels unlikely.
3. Something at your local end is causing a network bandwidth issue and/or I/O writes to disk leading to data corruption.

I don't know how often the list of assemblies changes (weekly?) but perhaps you can just download the entries that changed instead of getting the entire set each time?

0
Entering edit mode

less jobs in parallel did the trick, only a 1% failed - but started using the datasets as well, thanks!

1
Entering edit mode
6 months ago
Mensur Dlakic ★ 13k

Genome updater:

genome_updater.sh -d "refseq" -g "bacteria" -c "all" -l "Complete Genome" -f "genomic.fna.gz" -o genomes/bacteria -t 10 -u -m -a


For proteomes:

genome_updater.sh -d "refseq" -g "bacteria" -c "all" -l "Complete Genome" -f "protein.faa.gz" -o proteomes/bacteria -t 10 -u -m -a


Beware that this will take at least half a day, possibly longer. As you were told already, a larger number of threads may prompt NCBI to slow down the transfer for your IP number (5-10 works fine for me).

0
Entering edit mode

wow that looks beautiful! looking forward to try it! Thanks!