3
3
Entering edit mode
8.4 years ago
JV ▴ 450

Hi,

I want to download all available genomes of multiple bacterial and archeal genera.

e.g:

wget -A .gbk ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Methanococcus_maripaludis_C5_uid58741/


to get all genbank-files associated with methanococcus maripaludis C5.

However, what is driving me crazy is trying to go though the genomes-subfolders recursively using wildcards. For example if I want to get ALL genbanks of ALL methanococcus species (or if I did not want to find out the exact folder names by hand) something like:

wget -A .gbk ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Methanococcus*


This always gives me error messages. but I KNOW its possible in principle. I found these Instructions on Github for exactly the task I want, but they do not seem to work (perhaps the wget syntax has changed?)

wget -cNrv -t 45 -A *.gbk,*.fna "ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Genus_species*" -P .


Can anybody please tell me what I'm doing wrong?

ftp wget sequence NCBI genbank • 19k views
1
Entering edit mode

Could you maybe try this tiny mod:

wget -cNrv -t 45 -A "*.gbk,*.fna" "ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Genus_species*" -P .
1
Entering edit mode

Sadly that did not work either. I get an "404: Not Found"-Error, as with all my attempts to use wildcards in the directory names.

Does this command work for you (e.g. using "Methanococcus" and genus name)? Could it be that it is simply a problem with my network/proxy settings?

1
Entering edit mode

wget -cNrv -t 45 -A "*.gbk,*.fna" "ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Methanococcus*" -P .


I Ctrl-C'd a few seconds into it. A ls -R reveals:

$ls -R ftp.ncbi.nih.gov ./ftp.ncbi.nih.gov: genomes ./ftp.ncbi.nih.gov/genomes: Bacteria ./ftp.ncbi.nih.gov/genomes/Bacteria: Methanococcus_aeolicus_Nankai_3_uid58823 Methanococcus_maripaludis_C5_uid58741 Methanococcus_maripaludis_C6_uid58947 Methanococcus_maripaludis_C7_uid58847 ./ftp.ncbi.nih.gov/genomes/Bacteria/Methanococcus_aeolicus_Nankai_3_uid58823: NC_009635.fna NC_009635.gbk ./ftp.ncbi.nih.gov/genomes/Bacteria/Methanococcus_maripaludis_C5_uid58741: NC_009135.fna NC_009135.gbk NC_009136.fna NC_009136.gbk ./ftp.ncbi.nih.gov/genomes/Bacteria/Methanococcus_maripaludis_C6_uid58947: NC_009975.fna NC_009975.gbk ./ftp.ncbi.nih.gov/genomes/Bacteria/Methanococcus_maripaludis_C7_uid58847: NC_009637.fna  The prefix part is kinda ignored though, I think. I see the command creating a directory named "ftp.ncbi.nih.gov" and then subdirectories as on the server, ADD REPLY 1 Entering edit mode Using that exact command I get this error: Proxy request sent, awaiting response... 404 Not Found 2014-10-24 17:07:02 ERROR 404: Not Found.  So it seems there is an issue with getting the downloads thorough the proxy server. But that seems strange to me, because the download works perfectly if I use no wildcards (as in my first example). I'll try downloading from my home-computer later and then transfering the files to my workstation per usb-stick. EDIT: Also I get the Warning: Warning: wildcards not supported in HTTP  ADD REPLY 1 Entering edit mode Try using --accept-regex instead of -A. (Just a random suggestion) Also, what is your OS and your $SHELL?

1
Entering edit mode

I working on servers using Red Hat Linux. The default shell (and the one I'm using) there is Bash.

Switching to --accept-regex did not help, by the way.

EDIT: Same problem persists on a workstation with ubuntu as OS and zsh as default shell (connected via the same proxy as the servers).

But still hoping that it'll work when I try from home later...

1
Entering edit mode

Well, I guess the sysadmin might have some kind of restriction on number of files you can download with a wildcard. To test that, could you maybe try with the pattern Methanococcus_aeolicus_Nankai* instead of Methanococcus*

1
Entering edit mode

Thanks, I'll try that when I get back to the office. But for now I can say, that from my home computer, the downloads with wget work perfectly. So its definitively the proxy settings at my institute.

3
Entering edit mode
8.4 years ago
Carlos Borroto ★ 2.0k

$rsync --dry-run -avP --include "*.gbk" --include "*.fna" --include "Methanococcus*/" --exclude "*" ftp.ncbi.nih.gov::genomes/Bacteria/ /tmp Warning Notice! You are accessing a U.S. Government information system which includes this computer, network, and all attached devices. This system is for Government-authorized use only. Unauthorized use of this system may result in disciplinary action and civil and criminal penalties. System users have no expectation of privacy regarding any communications or data processed by this system. At any time, the government may monitor, record, or seize any communication or data transiting or stored on this information system. ------------------------------------------------------------------------------- Welcome to the NCBI rsync server. receiving file list ... 27 files to consider ./ Methanococcus_aeolicus_Nankai_3_uid58823/ Methanococcus_aeolicus_Nankai_3_uid58823/NC_009635.fna Methanococcus_aeolicus_Nankai_3_uid58823/NC_009635.gbk Methanococcus_maripaludis_C5_uid58741/ Methanococcus_maripaludis_C5_uid58741/NC_009135.fna Methanococcus_maripaludis_C5_uid58741/NC_009135.gbk Methanococcus_maripaludis_C5_uid58741/NC_009136.fna Methanococcus_maripaludis_C5_uid58741/NC_009136.gbk Methanococcus_maripaludis_C6_uid58947/ Methanococcus_maripaludis_C6_uid58947/NC_009975.fna Methanococcus_maripaludis_C6_uid58947/NC_009975.gbk Methanococcus_maripaludis_C7_uid58847/ Methanococcus_maripaludis_C7_uid58847/NC_009637.fna Methanococcus_maripaludis_C7_uid58847/NC_009637.gbk Methanococcus_maripaludis_S2_uid58035/ Methanococcus_maripaludis_S2_uid58035/NC_005791.fna Methanococcus_maripaludis_S2_uid58035/NC_005791.gbk Methanococcus_maripaludis_X1_uid70729/ Methanococcus_maripaludis_X1_uid70729/NC_015847.fna Methanococcus_maripaludis_X1_uid70729/NC_015847.gbk Methanococcus_vannielii_SB_uid58767/ Methanococcus_vannielii_SB_uid58767/NC_009634.fna Methanococcus_vannielii_SB_uid58767/NC_009634.gbk Methanococcus_voltae_A3_uid49529/ Methanococcus_voltae_A3_uid49529/NC_014222.fna Methanococcus_voltae_A3_uid49529/NC_014222.gbk sent 248 bytes received 1604 bytes 3704.00 bytes/sec total size is 60678518 speedup is 32763.78  ADD COMMENT 0 Entering edit mode Thanks, but it seems the same proxy problems apply here. Doesn't work from my office computer. Will have to contact the sysadmin about this. ADD REPLY 0 Entering edit mode Is the NCBI rsync service still alive? I can't make it work.. ADD REPLY 0 Entering edit mode It is working from here. Maybe a firewall or proxy issue on your side? ADD REPLY 0 Entering edit mode Can you post some exact command that works for you? ADD REPLY 0 Entering edit mode I used the exact command I posted in my answer. Can you post the error you are getting? ADD REPLY 0 Entering edit mode Ah, it actually works from home so I guess it's a firewall issue. Damn it. This would be so handy ;_; ADD REPLY 1 Entering edit mode 8.4 years ago JV ▴ 450 OK I found a workaround and want to post it here, in case anybody has the same problem sometime... First what we determined to be a likely reason for this problem (As per discussion with some of my collegues): It seems the problem MAY be, that ncbi itself restricts the number of parallel requests to their ftp server. The wildcards in my request get expanded to a large number of parallel requests. Possibly the internet connection at my institue is too fast, and therefore the parallel requests sum up pretty fast, while my home connection is slow, so the requests are seemingly sent one after another instead of at the same time. This is not a researched explanation and not made from a fully qualified viewpoint, but as the following workaround worked for me, that is fine enough for me: Step1: Get the most current directory structure (WITHOUT contents) of the ncbi server. This can be done EITHER by running wget -r --no-remove-listing --spider ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria  (WARNING will take a LONG time) OR (more comfortably and faster) by logging in to the ncbi ftp server with Filezilla, going to the "/genomes/Bacteria" subfolder, marking all Subfolders, rightclicking and choose "Copy URLS do Clipboard". Then pasting the URLS to a text file. Step 2: Make a list of the genera you want to download the gbks from (one per line) Step 3: Iterate trough your list of genera, pipe the contents and "grep" them from your list of ftp-subfolders; give the results as arguments to "wget". Example: cat genus.list | while read genus; do grep$genus urllist.txt| while read url; do wget -cNrv -t 45 -A "*.gbk" \$url -P .; done; done


You can regularly repeat these steps in the same working directory to update your genbanks IF new genomes from your genera of interest have been added to NCBI

1
Entering edit mode

Looks great! A quick suggestion: Instead of the command:

wget -r --no-remove-listing --spider ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria

wget -l 2 --no-remove-listing --spider ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria

The above command does not download actual files and spiders to only 2 levels, so it extremely fast.

0
Entering edit mode

Thanks, thats a good tip!

I also started using the "--ignore-directories" argument to exclude all the "wgs", "pubmed" etc folders in the root-directory of the ftp server in step 3.

This is because , even though I call wget with a very specific url in each iteration, it still goes through ALL of the folders of the ncbi-ftp-server, downloads an "index.html" and immidiately deletes it again. This is also taking a lot of time.

Do you perhaps also have a tip on how to stop wget from doing this? (or maybe it doen't matter so much anymore, if it only goes down two subfolders into each subdiretory-tree after adding your suggestion)

0
Entering edit mode

I think the index.html serving is a server-side configuration, not sure if we can do something about it. Also, there is no --ignore-directories option. Do you mean --exclude-directories, perhaps?

0
Entering edit mode

yes --exclude-directories was what I meant (Got that a bit mixed up here).

0
Entering edit mode

By the way: For me replacing -r with -l 2 did not work for the download step (step 3). I thought I could make it easy for me there.... If I do that, I get only a bunch of html documents linking to the pages on the ftp-server that contains the .gb files (not the files themselves and also not the directory structure).

0
Entering edit mode

The -l 2 was for the first step which you mentioned took a lot of time. It is a spider step that is useful to fetch URLs. The actual data download step, which is step 3 in your workflow, will need the -r option. I apologize - I should've made my earlier statement clearer.

0
Entering edit mode

no no, your statement was clear (you meant it for the fist step). I was just hoping I could apply it to the third step also and was confused when it did not work

0
Entering edit mode

Also, yes. The -r is required for the actual download. The -l option was just to limit the spider-ing level.

0
Entering edit mode
8.2 years ago
freedy96 • 0

Hey try LongPathTool to solve errors in wget.

0
Entering edit mode

Ummm, I hope you read the contents of the thread. OP has no problem with long file names. LongPathTool is built for Windows and that makes sense, because no UNIX based OS suffers from "file path too long" errors. I don't see how LongPathTool can help with wget on Linux-based systems.