Question: Downloading multiple species from ftp.ncbi.nih.gov using wget and wildcards
3
gravatar for JV
4.6 years ago by
JV380
Germany
JV380 wrote:

Hi,

I want to download all available genomes of multiple bacterial and archeal genera.

Downloading the genbanks for a single species is relatively easy (if you already know the exact folder_name on the ftp-server.

e.g:

 "wget -A .gbk ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Methanococcus_maripaludis_C5_uid58741/"

to get all genbank-files associated with methanococcus maripaludis C5.

However, what is driving me crazy is trying to go though the genomes-subfolders recursively using wildcards. For example if I want to get ALL genbanks of ALL methanococcus species (or if i did not want to find out the exact folder names by hand) something like:

"wget -A .gbk ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Methanococcus*"

This always gives me error messages. but I KNOW its possible in principle. I found these Instructions on Github for exactly the task i want, but they do not seem to work (perhaps the wget syntax has changed?)

"wget -cNrv -t 45 -A *.gbk,*.fna "ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Genus_species*" -P ."

Can anybody please tell me what I'm doing wrong?

 

ftp genbank sequence ncbi wget • 12k views
ADD COMMENTlink modified 4.4 years ago by freedy960 • written 4.6 years ago by JV380
1

Could you maybe try this tiny mod:

wget -cNrv -t 45 -A "*.gbk,*.fna" "ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Genus_species*" -P .
ADD REPLYlink written 4.6 years ago by RamRS21k
1

Sadly that did not work either. I get an "404: Not Found"-Error, as with all my attempts to use wildcards in the directory names.

Does this command work for you (e.g. using "Methanococcus" and genus name)? Could it be that it is simply a problem with my network/proxy settings?

ADD REPLYlink written 4.6 years ago by JV380
1

This exact command works for me, downloading multiple Methanococcus directories' FNA and GBK files:

wget -cNrv -t 45 -A "*.gbk,*.fna" "ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Methanococcus*" -P .

I Ctrl-C'd a few seconds into it. A ls -R reveals:

$ ls -R
ftp.ncbi.nih.gov

./ftp.ncbi.nih.gov:
genomes

./ftp.ncbi.nih.gov/genomes:
Bacteria

./ftp.ncbi.nih.gov/genomes/Bacteria:
Methanococcus_aeolicus_Nankai_3_uid58823 Methanococcus_maripaludis_C5_uid58741    Methanococcus_maripaludis_C6_uid58947    Methanococcus_maripaludis_C7_uid58847

./ftp.ncbi.nih.gov/genomes/Bacteria/Methanococcus_aeolicus_Nankai_3_uid58823:
NC_009635.fna NC_009635.gbk

./ftp.ncbi.nih.gov/genomes/Bacteria/Methanococcus_maripaludis_C5_uid58741:
NC_009135.fna NC_009135.gbk NC_009136.fna NC_009136.gbk

./ftp.ncbi.nih.gov/genomes/Bacteria/Methanococcus_maripaludis_C6_uid58947:
NC_009975.fna NC_009975.gbk

./ftp.ncbi.nih.gov/genomes/Bacteria/Methanococcus_maripaludis_C7_uid58847:
NC_009637.fna

 

The prefix part is kinda ignored though, I think. I see the command creating a directory named "ftp.ncbi.nih.gov" and then subdirectories as on the server,

ADD REPLYlink written 4.6 years ago by RamRS21k
1

Using that exact command I get this error:

"Proxy request sent, awaiting response... 404 Not Found
2014-10-24 17:07:02 ERROR 404: Not Found."

So it seems there is an issue with getting the downloads thorough the proxy server.

But that seems strange to me, because the download works perfectly if I use no wildcards (as in my first example).

I'll try downloading from my home-computer later and then transfering the files to my workstation per usb-stick.

EDIT:

Also i get the Warning:

"Warning: wildcards not supported in HTTP"

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by JV380
1

Try using --accept-regex instead of -A. (Just a random suggestion)

Also, what is your OS and your $SHELL?

ADD REPLYlink written 4.6 years ago by RamRS21k
1

I working on servers using Red Hat Linux. The default shell (and the one I'm using) there is Bash.

switching to "--accept-regex" did not help, by the way.

EDIT: Same problem persists on a workstation with ubuntu as OS and zsh as default shell (connected via the same proxy as the servers).

But still hoping that it'll work when i try from home later...

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by JV380
1

Well, I guess the sysadmin might have some kind of restriction on number of files you can download with a wildcard. To test that, could you maybe try with the pattern Methanococcus_aeolicus_Nankai* instead of Methanococcus*

ADD REPLYlink written 4.6 years ago by RamRS21k
1

thanks, i'll try that when I get back to the office. But for now I can say, that from my home computer, the downloads with wget work perfectly.  So its definitively the proxy settings at my institute.

ADD REPLYlink written 4.6 years ago by JV380
3
gravatar for Carlos Borroto
4.6 years ago by
Carlos Borroto1.8k
Washington Metropolitan Area
Carlos Borroto1.8k wrote:

How about using rsync instead?

 

$ rsync --dry-run -avP --include "*.gbk" --include "*.fna" --include "Methanococcus*/" --exclude "*" ftp.ncbi.nih.gov::genomes/Bacteria/ /tmp

Warning Notice!

You are accessing a U.S. Government information system which includes this
computer, network, and all attached devices. This system is for
Government-authorized use only. Unauthorized use of this system may result in
disciplinary action and civil and criminal penalties. System users have no
expectation of privacy regarding any communications or data processed by this
system. At any time, the government may monitor, record, or seize any
communication or data transiting or stored on this information system.

-------------------------------------------------------------------------------

Welcome to the NCBI rsync server.


receiving file list ...
27 files to consider
./
Methanococcus_aeolicus_Nankai_3_uid58823/
Methanococcus_aeolicus_Nankai_3_uid58823/NC_009635.fna
Methanococcus_aeolicus_Nankai_3_uid58823/NC_009635.gbk
Methanococcus_maripaludis_C5_uid58741/
Methanococcus_maripaludis_C5_uid58741/NC_009135.fna
Methanococcus_maripaludis_C5_uid58741/NC_009135.gbk
Methanococcus_maripaludis_C5_uid58741/NC_009136.fna
Methanococcus_maripaludis_C5_uid58741/NC_009136.gbk
Methanococcus_maripaludis_C6_uid58947/
Methanococcus_maripaludis_C6_uid58947/NC_009975.fna
Methanococcus_maripaludis_C6_uid58947/NC_009975.gbk
Methanococcus_maripaludis_C7_uid58847/
Methanococcus_maripaludis_C7_uid58847/NC_009637.fna
Methanococcus_maripaludis_C7_uid58847/NC_009637.gbk
Methanococcus_maripaludis_S2_uid58035/
Methanococcus_maripaludis_S2_uid58035/NC_005791.fna
Methanococcus_maripaludis_S2_uid58035/NC_005791.gbk
Methanococcus_maripaludis_X1_uid70729/
Methanococcus_maripaludis_X1_uid70729/NC_015847.fna
Methanococcus_maripaludis_X1_uid70729/NC_015847.gbk
Methanococcus_vannielii_SB_uid58767/
Methanococcus_vannielii_SB_uid58767/NC_009634.fna
Methanococcus_vannielii_SB_uid58767/NC_009634.gbk
Methanococcus_voltae_A3_uid49529/
Methanococcus_voltae_A3_uid49529/NC_014222.fna
Methanococcus_voltae_A3_uid49529/NC_014222.gbk

sent 248 bytes  received 1604 bytes  3704.00 bytes/sec
total size is 60678518  speedup is 32763.78

 

ADD COMMENTlink written 4.6 years ago by Carlos Borroto1.8k

thanks, but it seems the same proxy problems apply here. Doesn't work from my office computer.  Will have to contact the sysadmin about this.

ADD REPLYlink written 4.6 years ago by JV380

Is the NCBI rsync service still alive? I can't make it work..

ADD REPLYlink written 3.2 years ago by 5heikki8.4k

It is working from here. Maybe a firewall or proxy issue on your side?

ADD REPLYlink written 3.2 years ago by Carlos Borroto1.8k

Can you post some exact command that works for you?

ADD REPLYlink written 3.2 years ago by 5heikki8.4k

I used the exact command I posted in my answer. Can you post the error you are getting?

ADD REPLYlink written 3.2 years ago by Carlos Borroto1.8k

Ah, it actually works from home so I guess it's a firewall issue. Damn it. This would be so handy ;_;

ADD REPLYlink written 3.2 years ago by 5heikki8.4k
1
gravatar for JV
4.6 years ago by
JV380
Germany
JV380 wrote:

OK I found a workaround and want to post it here, in case anybody has the same problem sometime...

First what we determined to be a likely reason for this problem (As per discussion with some of my collegues): It seems the problem MAY be, that ncbi itself restricts the number of parallel requests to their ftp server. The wildcards in my request get expanded to a large number of parallel requests. Possibly the internet connection at my institue is too fast, and therefore the parallel requests sum up pretty fast, while my home connection is slow, so the requests are seemingly sent one after another instead of at the same time. This is not a researched explanation and not made from a fully qualified viewpoint, but as the following workaround worked for me, that is fine enough for me:


step1:

Get the most current directory structure (WITHOUT contents) of the ncbi server.

This can be done EITHER by running

   "wget -r --no-remove-listing --spider ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria"

(WARNING will take a LONG time) OR (more comfortably and faster) by logging in to the ncbi ftp server with Filezilla, going to the "/genomes/Bacteria" subfolder, marking all Subfolders, rightclicking and choose "Copy URLS do Clipboard". Then pasting the URLS to a text file.


step2:

  make a list of the genera you want to download the gbks from (one per line)


step 3

iterate trough your list of genera, pipe the contents and "grep" them from your list of ftp-subfolders; give the results as arguments to "wget".

example:

cat genus.list| while read genus; do grep $genus urllist.txt| while read url; do wget -cNrv -t 45 -A "*.gbk" $url -P .; done; done

you can regularly repeat these steps in the same working directory to update your genbanks IF new genomes from your genera of interest have been added to NCBI

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by JV380
1

Looks great! A quick suggestion: Instead of the command:

wget -r --no-remove-listing --spider ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria

that recursively traverses the entire directory hierarchy, downloading all the folders in the process (-r and --spider still lead to files being downloaded). why not run this:

wget -l 2 --no-remove-listing --spider ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria

The above command does not download actual files and spiders to only 2 levels, so it extremely fast.

ADD REPLYlink written 4.6 years ago by RamRS21k

Thanks, thats a good tip!

I also started using the "--ignore-directories" argument to exclude all the "wgs", "pubmed" etc folders in the root-directory of the ftp server in step 3.

This is because , even though I call wget with a very specific url in each iteration, it still goes through ALL of the folders of the ncbi-ftp-server, downloads an "index.html" and immidiately deletes it again. This is also taking a lot of time.

Do you perhaps also have a tip on how to stop wget from doing this? (or maybe it doen't matter so much anymore, if it only goes down two subfolders into each subdiretory-tree after adding your suggestion)

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by JV380

I think the index.html serving is a server-side configuration, not sure if we can do something about it. Also, there is no --ignore-directories option. Do you mean --exclude-directories, perhaps?

ADD REPLYlink written 4.6 years ago by RamRS21k

yes "--exclude-directories" was what I meant (Got that a bit mixed up here). 

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by JV380

By the way: For me replacing "-r" with "-l 2" did not work for the download step (step 3). I thought i could make it easy for me there.... If i do that, I get only a bunch of html documents linking to the pages on the ftp-server that contains the .gb files (not the files themselves and also not the directory structure).

ADD REPLYlink written 4.6 years ago by JV380

The -l 2 was for the first step which you mentioned took a lot of time. It is a spider step that is useful to fetch URLs. The actual data download step, which is step 3 in your workflow, will need the -r option. I apologize - I should've made my earlier statement clearer.

ADD REPLYlink written 4.6 years ago by RamRS21k
0
gravatar for JV
4.6 years ago by
JV380
Germany
JV380 wrote:

no no, your statement was clear (you meant it for the fist step). I was just hoping I could apply it to the third step also and was confused when it did not work

ADD COMMENTlink written 4.6 years ago by JV380

You might wanna move this as a comment reply.

ADD REPLYlink written 4.6 years ago by RamRS21k

Also, yes. The -r is required for the actual download. The -l option was just to limit the  spider-ing level.

ADD REPLYlink written 4.6 years ago by RamRS21k
0
gravatar for freedy96
4.4 years ago by
freedy960
freedy960 wrote:

Hey try LongPathTool to solve errors in wget.

ADD COMMENTlink written 4.4 years ago by freedy960

Ummm, I hope you read the contents of the thread. OP has no problem with long file names. LongPathTool is built for Windows and that makes sense, because no UNIX based OS suffers from "file path too long" errors. I don't see how LongPathTool can help with wget on Linux-based systems.

ADD REPLYlink written 4.4 years ago by RamRS21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1135 users visited in the last hour