Question: How to download FASTA protein sequences for Escherichia coli strains from Ensembl?
0
gravatar for agata88
8 months ago by
agata88800
Poland
agata88800 wrote:

Hi all!

I would like to download FASTA protein sequences from all Escherichia coli strains. At the Ensembl Bacteria page I see that I should download 2 681 files (https://bacteria.ensembl.org/info/website/ftp/index.html).

I would like to do it in programmatically way and use Ensembl Rest API. Unfortunately, I cannot find the best API Endpoints. Can anyone suggest me the best solution?

Here are the API Endpoints: http://rest.ensembl.org

Many thanks for any suggestions,

Best,

Agata

ADD COMMENTlink modified 8 months ago by Ben_Ensembl1.6k • written 8 months ago by agata88800

How about using NCBI and ncbi-genome-download tool by Kai Blin? Same data everywhere.

As simple as

ncbi-genome-download --genus "Escherichia coli" bacteria
ADD REPLYlink modified 8 months ago • written 8 months ago by genomax91k
3
gravatar for Ben_Ensembl
8 months ago by
Ben_Ensembl1.6k
EMBL-EBI
Ben_Ensembl1.6k wrote:

Hi Agata,

Genomax's solution seems the most straight forward, but I thought I'd add information about how to do this with Ensembl. This isn't possible with the Ensembl REST API. You will have to use a combination of the Perl API and curl/wget:

(1) Extracting the E. coli species

The LookUp module with the parent taxon id for E. coli will help. The LookUp module exists in the ensemblegenome-api repository. So this git repo needs to be in your PERL5LIB too: https://github.com/EnsemblGenomes/ensemblgenomes-api Documentation here: http://ensemblgenomes.org/info/access/eg_api

(2) Getting the peptide FASTA files from the FTP site:

We organise our bacteria into groups called “collections”. This is just to help us manage the data volumes. Accordingly, our FTP server is organised to reflect these groupings too. With each release, we provide a file below that says which species are grouped into which collection: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-46/species_EnsemblBacteria.txt

You can use this file to work out the right URL for files you want from the FTP server, and use curl or wget. The code snippet attached does the lookup and download:

use strict;
use warnings;
use Bio::EnsEMBL::LookUp;

# Build a helper to query the Ensembl public MySQL instance
my $lookup = Bio::EnsEMBL::LookUp->new();

my @dbas = @{$lookup->get_all_by_taxon_branch(562)};

foreach my $dba (@dbas){

        my $species = $dba->species();
        my $cmd = "grep -i \"$species\" all_bacteria_in_ensembl.txt | cut -f 13 | sed -n 's/\\(bacteria_[0-9]*_collection\\).*/\\1/p'";
        my $collection_name = `$cmd`;
        chomp($collection_name);
        if($collection_name){

                my $ftp_pep="ftp://ftp.ensemblgenomes.org/pub/bacteria/release-46/fasta/$collection_name/$species/pep/*.pep.all.fa.gz";
                print "Fetching PEP file for $species: $ftp_pep \n";
                `wget $ftp_pep`;
        }

        $dba->dbc()->disconnect_if_idle(); # Important to disconnect so that you do not accidentally flood the server with unused connections

}

Best wishes

Ben Ensembl Helpdesk

ADD COMMENTlink written 8 months ago by Ben_Ensembl1.6k

Thanks for the example.

It would be very nice if Ensembl could provide "all_bacteria" data dumps of FASTA in the future that can be easily grabbed by FTP. At least for all genomes, all proteins etc. Thanks.

ADD REPLYlink written 4 weeks ago by colindaven2.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1459 users visited in the last hour