Question

How Can I Get Protein Sequence In Fasta Format Using Taxon Id?

6

Entering edit mode

14.0 years ago

Luke ▴ 240

I am working with bacterial genomes. I would perform a phylogenetic analysis using protein sequences. I have 65 organisms and 30 genes. How can I obtain 30 fasta format files, each with my 65 organisms protein sequences, using taxon id (or another identifier) for each organism and gene id (i.e. "alaS" for alanyl-trna synthetase) as search queries? Is there any perl script to extract those sequences? How can I do that? Regards, Luke

perl sequence search retrieval • 7.8k views

ADD COMMENT • link updated 14.0 years ago by Chris Fields ★ 2.2k • written 14.0 years ago by Luke ▴ 240

Ram · Answer 1 · 2010-10-22

Since you mention taxon ID, I assume that you're working with identifiers from NCBI Entrez.

To search and retrieve from Entrez using Perl, take a look at the Bioperl EUtilities Cookbook. You will have to decide which of the terms that you have would make good search terms and in addition, which database to search. For example, taxon ID is a "better" term than gene symbol (alaS is a symbol, not an ID), because it is clearly defined and will return only results for that taxon. Gene symbol may be somewhat more ambiguous. However, not all databases allow taxon ID as a search term: for example the Gene database will, but the Protein database will not. In the latter case, search by organism name is allowed and could work well.

Here's an example lifted from the cookbook and adapted to search for AlaS from Butyrivibrio proteoclasticus:

use strict;
use Bio::DB::EUtilities;

# set optional history queue
my $factory = Bio::DB::EUtilities->new(-eutil      => 'esearch',
                                       -email      => 'mymail@foo.bar',
                                       -db         => 'protein',
                                       -term       => 'Butyrivibrio proteoclasticus[ORGN] AND alaS[Gene/Protein Name]',
                                       -usehistory => 'y');

my $count = $factory->get_count;
# get history from queue
my $hist  = $factory->next_History || die 'No history data returned\n';
print "History returned\n";
# note db carries over from above
$factory->set_parameters(-eutil   => 'efetch',
                         -rettype => 'fasta',
                         -history => $hist);

my $retry = 0;
my ($retmax, $retstart) = (500,0);

open (my $out, '>', 'seqs.fa') || die "Can't open file:$!\n";
RETRIEVE_SEQS:
while ($retstart < $count) {
    $factory->set_parameters(-retmax   => $retmax,
                             -retstart => $retstart);
    eval{
        $factory->get_Response(-cb => sub {my ($data) = @_; print $out $data} );
    };
    if ($@) {
        die "Server error: $@.  Try again later\n" if $retry == 5;
        print STDERR "Server error, redo #$retry\n";
        $retry++ && redo RETRIEVE_SEQS;
    }
    print "Retrieved $retstart\n";
    $retstart += $retmax;
}
close $out;

If you save that as efetch.pl and run it, it should print Retrieved 0 (a bit misleading!) and write out a file, seqs.fa, with 2 AlaS protein sequences in FASTA format.

Now all you need to do is write the loop which supplies organisms and gene symbols.

As mentioned, the Gene database allows search by taxon ID: if you went that route, you'd need to figure out how to get to a protein entry (or how to return a CDS which you can then translate).

Ram · Answer 2 · 2010-10-22

Perhaps BioMart is a good solution for you. Have a look at martview, choose the 'Bacterial Mart' database and some bacteria dataset. In 'Attributes' select 'Sequences' and under under 'Header Information' select only 'Associated Gene Name'. For the gene name you have to select a filter. Click on 'Filters', then 'GENE:', and pick in the select-box of 'ID list limit' the item 'Associated Gene Name(s)'. Copy & paste 'alaS' into the box. Now hit 'Results' to query the database.

For the Perl: when displaying the result, click on the 'Perl' button and a pop-up will show that gives you the Perl code for the query that you have just run. It will look something like this:

$query->setDataset("bac_20340_gene");
$query->addFilter("external_gene_id", ["alaS"]);
$query->addAttribute("peptide");
$query->addAttribute("external_gene_id");
$query->formatter("FASTA");

If the BioMart data is sufficient for you, you could then go ahead and write a loop around that code that queries the genes for protein sequences that you are interested in automatically.

I admit that finding out the correct dataset to choose might be difficult using this solution.

score 4 · Answer 3 · 2010-10-22

I know everyone speaks in code here--and you asked if you could do it with Perl, but I'd get that out of UniProt with the end-user interface. That way I can assess if I really am seeing the symbol I intend, and get a sense of the overview of the results.

For example, I just did this query: alas AND gene:alas AND taxonomy:"Vibrio [662]" and am looking at a list of results, including several strains of cholera (I was thinking about this because of the Haiti news today, so I just used it as an example). Do you want multiple strains? Or are you going to pick one representative? I think when people go right to downloads they don't always get a full picture of what's available and there may be additional things you'd want to decide on and refine your query. Same thing when I put in Shigella--multiple species + strains.

A simple query for "alas" could give you the whole species set that you could look through and click checkboxes. But you can also refine with a second taxon query box.

From this you could then check the proteins you want, choose "retrieve" at the bottom, and download as fasta.

Or you can script stuff as the folks above indicate.

score 3 · Answer 4 · 2010-10-22

3

Entering edit mode

14.0 years ago

Jan Kosinski ★ 1.6k

Do you have "NCBI gene ID" instead of "gene symbol"? Gene ID is numeric and unequivocally identifies your gene in NCBI database, "gene symbol" not necessarily...

For querying NCBI for such stuff as yours you can try Entrez Programming Utilities: http://eutils.ncbi.nlm.nih.gov/

But I would be careful with trying to get protein entry from only a gene symbol "alaS" and organism id (what if there are more than one genes with that symbol?).

ADD COMMENT • link 14.0 years ago by Jan Kosinski ★ 1.6k

0

Entering edit mode

Yes, a gene ID is better than a symbol. If more than one gene has the symbol, then you'll get more than one sequence returned. Of course, there may be more than one copy of the gene in the organism. The example that I posted returns 2 protein sequences, both AlaS.

ADD REPLY • link 14.0 years ago by Neilfws 49k

Ram · Answer 5 · 2010-10-22

Here's an example using Bio::DB::EUtilities code that queries NCBI Protein Clusters, finds the linked protein sequences, then returns the output to STDOUT in FASTA format (straight dump from the returned text stream, not via Bio::SeqIO).

#!/usr/bin/perl -w

use strict;
use warnings;
use Bio::DB::EUtilities;

my $term = "Bacteria[ORGN] AND alaS[gene]";

my $eutil = Bio::DB::EUtilities->new(-eutil      => 'esearch',
                                     -db         => 'proteinclusters',
                                     -email      => 'myfoo@bar.com',
                                     -term       => $term,
                                     -usehistory => 'y');

my $hist = $eutil->next_History || die "No history returned";

$eutil->reset_parameters(-eutil           => 'elink',
                         -db              => 'protein',
                         -email           => 'myfoo@bar.com',
                         -correspondence  => 1,
                         -dbfrom          => 'proteinclusters',
                         -history         => $hist);

my @protein_ids = $eutil->get_ids;

$eutil->reset_parameters(-eutil      => 'efetch',
                         -db         => 'protein',
                         -rettype    => 'fasta',
                         -email      => 'myfoo@bar.com',
                         -id         => \@protein_ids);

print $eutil->get_Response->content;