Question

Forum:How to download multiple .gz files from Ensembl Protists database

3

Entering edit mode

2.2 years ago

rohitsatyam102 ▴ 840

Hi Every one

I am writing this post so that if someone out there is struggling with problems on how to download the data in batch from Ensembl, they can be helped. Life was easy when ensemble had ftp links and we could use a regex *.gz in front of the URL to download multiple fasta files. However, since Chrome remove support for ftp Ensembl is migrating all its URL from ftp to http. This is both good and bad news. Good because now you can view the directories in the browser which were previously giving errors with FTP links and bad because downloading multiple data with wget and curl using HTTP links is not straightforward. I got my entire day wasted to figure out if I could use wget, curl or rsync somehow to download multiple files from Ensembl. I finally found a solution and I encourage others to extend this thread with more insights. Here I take an example of a protist


wget -r --no-parent --no-check-certificate -nd -nc -np -e robots=off -A.gz http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-52/protists/fasta/protists_alveolata1_collection/theileria_equi_strain_wa_gca_000342415/cdna/

Explanation of the tags was obtained from Source

-r signifies that wget should recursively download data in any subdirectories it finds.
-nd copies all matching files to the current directory. If two files have identical names it appends an extension.
-nc does not download a file if it already exists.
-np prevents files from parent directories from being downloaded.
-e robots=off tells wget to ignore the robots.txt file. If this command is left out, the robots.txt file tells wget that it does not like web crawlers and this will prevent wget from working.
-A.gz restricts downloading to the specified file types (with .gz suffix in this case)
–no-check-certificate disregards the SSL certificate check. This is useful if the SSL certificate is setup incorrectly, but make sure you only do this on servers you trust.

wget protist ensembl • 910 views

ADD COMMENT • link updated 9 weeks ago by BioinfGuru ★ 1.7k • written 2.2 years ago by rohitsatyam102 ▴ 840

0

Entering edit mode

Thanks for this post, big help. In my case, I wanted to select specific .gz files from an ensembl directory. Thought I'd post as a comment to highlight the additional useful option --accept-regex

$ wget -r -nd -np -nc -e robots=off -A ".fa.gz" --accept-regex "dna.primary_assembly.[0-9|X|Y]+.fa.gz"  https://ftp.ensembl.org/pub/current/fasta/sus_scrofa/dna/

With -A ".gz" all files ending with .gz in the .../dna/ directory are downloaded (many of which I don't need)

With --accept-regex you can download only the files you want which in my case was the primary assembly, without masking, for all chromosomes.

I included both options in my command because without -A, an index file I don't want is still downloaded

ADD REPLY • link 9 weeks ago by BioinfGuru ★ 1.7k