Hi Agata,
Genomax's solution seems the most straight forward, but I thought I'd add information about how to do this with Ensembl. This isn't possible with the Ensembl REST API. You will have to use a combination of the Perl API and curl/wget:
(1) Extracting the E. coli species
The LookUp module with the parent taxon id for E. coli will help. The LookUp module exists in the ensemblegenome-api repository. So this git repo needs to be in your PERL5LIB too: https://github.com/EnsemblGenomes/ensemblgenomes-api
Documentation here: http://ensemblgenomes.org/info/access/eg_api
(2) Getting the peptide FASTA files from the FTP site:
We organise our bacteria into groups called “collections”. This is just to help us manage the data volumes. Accordingly, our FTP server is organised to reflect these groupings too. With each release, we provide a file below that says which species are grouped into which collection:
ftp://ftp.ensemblgenomes.org/pub/bacteria/release-46/species_EnsemblBacteria.txt
You can use this file to work out the right URL for files you want from the FTP server, and use curl or wget. The code snippet attached does the lookup and download:
use strict;
use warnings;
use Bio::EnsEMBL::LookUp;
# Build a helper to query the Ensembl public MySQL instance
my $lookup = Bio::EnsEMBL::LookUp->new();
my @dbas = @{$lookup->get_all_by_taxon_branch(562)};
foreach my $dba (@dbas){
my $species = $dba->species();
my $cmd = "grep -i \"$species\" all_bacteria_in_ensembl.txt | cut -f 13 | sed -n 's/\\(bacteria_[0-9]*_collection\\).*/\\1/p'";
my $collection_name = `$cmd`;
chomp($collection_name);
if($collection_name){
my $ftp_pep="ftp://ftp.ensemblgenomes.org/pub/bacteria/release-46/fasta/$collection_name/$species/pep/*.pep.all.fa.gz";
print "Fetching PEP file for $species: $ftp_pep \n";
`wget $ftp_pep`;
}
$dba->dbc()->disconnect_if_idle(); # Important to disconnect so that you do not accidentally flood the server with unused connections
}
Best wishes
Ben
Ensembl Helpdesk
How about using NCBI and
ncbi-genome-download
tool by Kai Blin? Same data everywhere.As simple as