Hi Every one
I am writing this post so that if someone out there is struggling with problems on how to download the data in batch from Ensembl, they can be helped. Life was easy when ensemble had
ftp links and we could use a regex
*.gz in front of the URL to download multiple fasta files. However, since Chrome remove support for
ftp Ensembl is migrating all its URL from
http. This is both good and bad news. Good because now you can view the directories in the browser which were previously giving errors with FTP links and bad because downloading multiple data with
curl using HTTP links
is not straightforward. I got my entire day wasted to figure out if I could use
rsync somehow to download multiple files from Ensembl. I finally found a solution and I encourage others to extend this thread with more insights. Here I take an example of a protist
wget -r --no-parent --no-check-certificate -nd -nc -np -e robots=off -A.gz http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-52/protists/fasta/protists_alveolata1_collection/theileria_equi_strain_wa_gca_000342415/cdna/
Explanation of the tags was obtained from Source
- -r signifies that wget should recursively download data in any subdirectories it finds.
- -nd copies all matching files to the current directory. If two files have identical names it appends an extension.
- -nc does not download a file if it already exists.
- -np prevents files from parent directories from being downloaded.
- -e robots=off tells wget to ignore the robots.txt file. If this command is left out, the robots.txt file tells wget that it does not like web crawlers and this will prevent wget from working.
- -A.gz restricts downloading to the specified file types (with .gz suffix in this case)
- –no-check-certificate disregards the SSL certificate check. This is useful if the SSL certificate is setup incorrectly, but make sure you only do this on servers you trust.