0
2
Entering edit mode
7 months ago

Hi Every one

I am writing this post so that if someone out there is struggling with problems on how to download the data in batch from Ensembl, they can be helped. Life was easy when ensemble had ftp links and we could use a regex *.gz in front of the URL to download multiple fasta files. However, since Chrome remove support for ftp Ensembl is migrating all its URL from ftp to http. This is both good and bad news. Good because now you can view the directories in the browser which were previously giving errors with FTP links and bad because downloading multiple data with wget and curl using HTTP links is not straightforward. I got my entire day wasted to figure out if I could use wget, curl or rsync somehow to download multiple files from Ensembl. I finally found a solution and I encourage others to extend this thread with more insights. Here I take an example of a protist


wget -r --no-parent --no-check-certificate -nd -nc -np -e robots=off -A.gz http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-52/protists/fasta/protists_alveolata1_collection/theileria_equi_strain_wa_gca_000342415/cdna/


Explanation of the tags was obtained from Source

• -r signifies that wget should recursively download data in any subdirectories it finds.
• -nd copies all matching files to the current directory. If two files have identical names it appends an extension.
• -e robots=off tells wget to ignore the robots.txt file. If this command is left out, the robots.txt file tells wget that it does not like web crawlers and this will prevent wget from working.
• –no-check-certificate disregards the SSL certificate check. This is useful if the SSL certificate is setup incorrectly, but make sure you only do this on servers you trust.
wget protist ensembl • 276 views