How to download FASTA protein sequences for Escherichia coli strains from Ensembl?
1
0
Entering edit mode
4.7 years ago
agata88 ▴ 870

Hi all!

I would like to download FASTA protein sequences from all Escherichia coli strains. At the Ensembl Bacteria page I see that I should download 2 681 files (https://bacteria.ensembl.org/info/website/ftp/index.html).

I would like to do it in programmatically way and use Ensembl Rest API. Unfortunately, I cannot find the best API Endpoints. Can anyone suggest me the best solution?

Here are the API Endpoints: http://rest.ensembl.org

Many thanks for any suggestions,

Best,

Agata

ensembl ensembl rest api bacteria sequence • 2.1k views
ADD COMMENT
0
Entering edit mode

How about using NCBI and ncbi-genome-download tool by Kai Blin? Same data everywhere.

As simple as

ncbi-genome-download --genus "Escherichia coli" bacteria
ADD REPLY
3
Entering edit mode
4.7 years ago
Ben Moore ★ 2.4k

Hi Agata,

Genomax's solution seems the most straight forward, but I thought I'd add information about how to do this with Ensembl. This isn't possible with the Ensembl REST API. You will have to use a combination of the Perl API and curl/wget:

(1) Extracting the E. coli species

The LookUp module with the parent taxon id for E. coli will help. The LookUp module exists in the ensemblegenome-api repository. So this git repo needs to be in your PERL5LIB too: https://github.com/EnsemblGenomes/ensemblgenomes-api Documentation here: http://ensemblgenomes.org/info/access/eg_api

(2) Getting the peptide FASTA files from the FTP site:

We organise our bacteria into groups called “collections”. This is just to help us manage the data volumes. Accordingly, our FTP server is organised to reflect these groupings too. With each release, we provide a file below that says which species are grouped into which collection: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-46/species_EnsemblBacteria.txt

You can use this file to work out the right URL for files you want from the FTP server, and use curl or wget. The code snippet attached does the lookup and download:

use strict;
use warnings;
use Bio::EnsEMBL::LookUp;

# Build a helper to query the Ensembl public MySQL instance
my $lookup = Bio::EnsEMBL::LookUp->new();

my @dbas = @{$lookup->get_all_by_taxon_branch(562)};

foreach my $dba (@dbas){

        my $species = $dba->species();
        my $cmd = "grep -i \"$species\" all_bacteria_in_ensembl.txt | cut -f 13 | sed -n 's/\\(bacteria_[0-9]*_collection\\).*/\\1/p'";
        my $collection_name = `$cmd`;
        chomp($collection_name);
        if($collection_name){

                my $ftp_pep="ftp://ftp.ensemblgenomes.org/pub/bacteria/release-46/fasta/$collection_name/$species/pep/*.pep.all.fa.gz";
                print "Fetching PEP file for $species: $ftp_pep \n";
                `wget $ftp_pep`;
        }

        $dba->dbc()->disconnect_if_idle(); # Important to disconnect so that you do not accidentally flood the server with unused connections

}

Best wishes

Ben Ensembl Helpdesk

ADD COMMENT
0
Entering edit mode

Thanks for the example.

It would be very nice if Ensembl could provide "all_bacteria" data dumps of FASTA in the future that can be easily grabbed by FTP. At least for all genomes, all proteins etc. Thanks.

ADD REPLY

Login before adding your answer.

Traffic: 1400 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6