retrieving entire genomic sequence contents of a database
2
0
Entering edit mode
2.8 years ago

Hi all, I'm trying to download all bacterial genomes from ensembl so I can further mine them for bacteriocin gene clusters. However I've been struggling and was hoping someone could advise? Any time I attempt the wget command on the index URL below I get results like "index.html". I've also tried things like wget ftp://ftp.bacteria.ensembl.org/species but no luck.

Can someone please advise on the steps I should take in order to be able to pull all genomic sequences from a database from the command line via ftp, preferably in gbk, gff, or fasta format.

Any help is greatly appreciated!

genomes mining database ftp ensembl • 757 views
ADD COMMENT
1
Entering edit mode
2.8 years ago
Mensur Dlakic ★ 27k

There is a script here that does massive genome data download, but from NCBI. For example, this command will download all RefSeq complete bacterial genomes:

genome_updater.sh -g "bacteria" -d "refseq" -l "Complete Genome" -f "genomic.fna.gz" -o "bac_refseq" -t 20

A small command-line change will let you download all GenBank genomes if you wish, and include even those (meta)genomes that may not be complete.

ADD COMMENT
0
Entering edit mode
2.8 years ago
Ben_Ensembl ★ 2.4k

Hi sandrewsaunderson,

If you are keen to use Ensembl for this task, it's important to remember that the bacterial files are stored in collections on the FTP site. E.g: http://ftp.ensemblgenomes.org/pub/bacteria/release-51/fasta/

This may be where you have encountered problems with your download.

Best wishes

Ben Ensembl Helpdesk

ADD COMMENT

Login before adding your answer.

Traffic: 1868 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6