Question: How Can I Get Protein Sequence In Fasta Format Using Taxon Id?
6
gravatar for Luke
10.3 years ago by
Luke240
Turin, Italy
Luke240 wrote:

I am working with bacterial genomes. I would perform a phylogenetic analysis using protein sequences. I have 65 organisms and 30 genes. How can I obtain 30 fasta format files, each with my 65 organisms protein sequences, using taxon id (or another identifier) for each organism and gene id (i.e. "alaS" for alanyl-trna synthetase) as search queries? Is there any perl script to extract those sequences? How can I do that? Regards, Luke

perl sequence search retrieval • 6.0k views
ADD COMMENTlink modified 10.3 years ago by Chris Fields2.1k • written 10.3 years ago by Luke240
8
gravatar for Neilfws
10.3 years ago by
Neilfws49k
Sydney, Australia
Neilfws49k wrote:

Since you mention taxon ID, I assume that you're working with identifiers from NCBI Entrez.

To search and retrieve from Entrez using Perl, take a look at the Bioperl EUtilities Cookbook. You will have to decide which of the terms that you have would make good search terms and in addition, which database to search. For example, taxon ID is a "better" term than gene symbol (alaS is a symbol, not an ID), because it is clearly defined and will return only results for that taxon. Gene symbol may be somewhat more ambiguous. However, not all databases allow taxon ID as a search term: for example the Gene database will, but the Protein database will not. In the latter case, search by organism name is allowed and could work well.

Here's an example lifted from the cookbook and adapted to search for AlaS from Butyrivibrio proteoclasticus:

use strict;
use Bio::DB::EUtilities;

# set optional history queue
my $factory = Bio::DB::EUtilities->new(-eutil      => 'esearch',
                                       -email      => 'mymail@foo.bar',
                                       -db         => 'protein',
                                       -term       => 'Butyrivibrio proteoclasticus[ORGN] AND alaS[Gene/Protein Name]',
                                       -usehistory => 'y');

my $count = $factory->get_count;
# get history from queue
my $hist  = $factory->next_History || die 'No history data returned\n';
print "History returned\n";
# note db carries over from above
$factory->set_parameters(-eutil   => 'efetch',
                         -rettype => 'fasta',
                         -history => $hist);

my $retry = 0;
my ($retmax, $retstart) = (500,0);

open (my $out, '>', 'seqs.fa') || die "Can't open file:$!\n";
RETRIEVE_SEQS:
while ($retstart < $count) {
    $factory->set_parameters(-retmax   => $retmax,
                             -retstart => $retstart);
    eval{
        $factory->get_Response(-cb => sub {my ($data) = @_; print $out $data} );
    };
    if ($@) {
        die "Server error: $@.  Try again later\n" if $retry == 5;
        print STDERR "Server error, redo #$retry\n";
        $retry++ && redo RETRIEVE_SEQS;
    }
    print "Retrieved $retstart\n";
    $retstart += $retmax;
}
close $out;

If you save that as efetch.pl and run it, it should print Retrieved 0 (a bit misleading!) and write out a file, seqs.fa, with 2 AlaS protein sequences in FASTA format.

Now all you need to do is write the loop which supplies organisms and gene symbols.

As mentioned, the Gene database allows search by taxon ID: if you went that route, you'd need to figure out how to get to a protein entry (or how to return a CDS which you can then translate).

ADD COMMENTlink modified 17 months ago by _r_am32k • written 10.3 years ago by Neilfws49k

thank you neil! it's very useful, but just other questions: -have I to write two different loops to obtain 1)one fasta file for the 1st gene (i.e "alaS.fa") with all organisms sequences (65); 2)repeat this process for all genes? -in the case an organism has not alaS gene (example!), does this script return an error?

ADD REPLYlink written 10.3 years ago by Luke240

I don't think this is great code - it's copied more or less as-is from the website. You could use the get_count method to check if the search generated results and then not run efetch if there were none. Looping should be easy: you just have put all of the above ()except the 'use' lines) in the loop and pass organism name (or taxid) + gene symbol as parameters to $factory. I'd just play with it and see what happens, best way to learn the code.

ADD REPLYlink written 10.3 years ago by Neilfws49k

I don't think this is great code - it's copied more or less as-is from the website. You could use the get_count method to check if the search generated results and then not run efetch if there were none. Looping should be easy: you just have put all of the above (except the 'use' lines) in the loop and pass organism name (or taxid) + gene symbol as parameters to $factory. I'd just play with it and see what happens, best way to learn the code.

ADD REPLYlink written 10.3 years ago by Neilfws49k

You can assume the annotation for the genes for all your genomes of interest are correct, and that a search query would give you what you want. I have found that assumption tends to be erroneous. Annotation pipelines may be conservative and not assign a function to predicted CDS w/o more evidence, or may not assign any function (tag them all hypothetical). Or (even worse) mis-assign the function you are seeking to the wrong gene. This also may not catch things you might want, such as paralogous sequences (ex. argS1, argS2).

Have you looked at ortholog databases? OrthoDB? InPararanoid?

ADD REPLYlink written 10.3 years ago by Chris Fields2.1k

Or (if you like NCBI and Bio::DB::EUtilities), search their Protein Clusters?

Also, just noticed that should be 'alaS', not 'argS' (sorry, happened to work in the arginine pathway in my youth.

ADD REPLYlink written 10.3 years ago by Chris Fields2.1k
4
gravatar for Joachim
10.3 years ago by
Joachim2.9k
San Francisco, California
Joachim2.9k wrote:

Perhaps BioMart is a good solution for you. Have a look at martview, choose the 'Bacterial Mart' database and some bacteria dataset. In 'Attributes' select 'Sequences' and under under 'Header Information' select only 'Associated Gene Name'. For the gene name you have to select a filter. Click on 'Filters', then 'GENE:', and pick in the select-box of 'ID list limit' the item 'Associated Gene Name(s)'. Copy & paste 'alaS' into the box. Now hit 'Results' to query the database.

For the Perl: when displaying the result, click on the 'Perl' button and a pop-up will show that gives you the Perl code for the query that you have just run. It will look something like this:

$query->setDataset("bac_20340_gene");
$query->addFilter("external_gene_id", ["alaS"]);
$query->addAttribute("peptide");
$query->addAttribute("external_gene_id");
$query->formatter("FASTA");

If the BioMart data is sufficient for you, you could then go ahead and write a loop around that code that queries the genes for protein sequences that you are interested in automatically.

I admit that finding out the correct dataset to choose might be difficult using this solution.

ADD COMMENTlink modified 17 months ago by _r_am32k • written 10.3 years ago by Joachim2.9k

Thanks Joachim! BioMart is very interesting. Unfortunately I can't find all organisms I need. Do you know a similar tool in NCBI?

ADD REPLYlink written 10.3 years ago by Luke240
4
gravatar for Mary
10.3 years ago by
Mary11k
Boston MA area
Mary11k wrote:

I know everyone speaks in code here--and you asked if you could do it with Perl, but I'd get that out of UniProt with the end-user interface. That way I can assess if I really am seeing the symbol I intend, and get a sense of the overview of the results.

For example, I just did this query: alas AND gene:alas AND taxonomy:"Vibrio [662]" and am looking at a list of results, including several strains of cholera (I was thinking about this because of the Haiti news today, so I just used it as an example). Do you want multiple strains? Or are you going to pick one representative? I think when people go right to downloads they don't always get a full picture of what's available and there may be additional things you'd want to decide on and refine your query. Same thing when I put in Shigella--multiple species + strains.

A simple query for "alas" could give you the whole species set that you could look through and click checkboxes. But you can also refine with a second taxon query box.

From this you could then check the proteins you want, choose "retrieve" at the bottom, and download as fasta.

Or you can script stuff as the folks above indicate.

ADD COMMENTlink written 10.3 years ago by Mary11k

Thanks Mary! Yes, I want multiple strains. It's a good solution!

ADD REPLYlink written 10.3 years ago by Luke240
3
gravatar for Jan Kosinski
10.3 years ago by
Jan Kosinski1.6k
Jan Kosinski1.6k wrote:

Do you have "NCBI gene ID" instead of "gene symbol"? Gene ID is numeric and unequivocally identifies your gene in NCBI database, "gene symbol" not necessarily...

For querying NCBI for such stuff as yours you can try Entrez Programming Utilities: http://eutils.ncbi.nlm.nih.gov/

But I would be careful with trying to get protein entry from only a gene symbol "alaS" and organism id (what if there are more than one genes with that symbol?).

ADD COMMENTlink written 10.3 years ago by Jan Kosinski1.6k

Yes, a gene ID is better than a symbol. If more than one gene has the symbol, then you'll get more than one sequence returned. Of course, there may be more than one copy of the gene in the organism. The example that I posted returns 2 protein sequences, both AlaS.

ADD REPLYlink written 10.3 years ago by Neilfws49k
1
gravatar for Chris Fields
10.3 years ago by
Chris Fields2.1k
University of Illinois Urbana-Champaign
Chris Fields2.1k wrote:

Here's an example using Bio::DB::EUtilities code that queries NCBI Protein Clusters, finds the linked protein sequences, then returns the output to STDOUT in FASTA format (straight dump from the returned text stream, not via Bio::SeqIO).

#!/usr/bin/perl -w

use strict;
use warnings;
use Bio::DB::EUtilities;

my $term = "Bacteria[ORGN] AND alaS[gene]";

my $eutil = Bio::DB::EUtilities->new(-eutil      => 'esearch',
                                     -db         => 'proteinclusters',
                                     -email      => 'myfoo@bar.com',
                                     -term       => $term,
                                     -usehistory => 'y');

my $hist = $eutil->next_History || die "No history returned";

$eutil->reset_parameters(-eutil           => 'elink',
                         -db              => 'protein',
                         -email           => 'myfoo@bar.com',
                         -correspondence  => 1,
                         -dbfrom          => 'proteinclusters',
                         -history         => $hist);

my @protein_ids = $eutil->get_ids;

$eutil->reset_parameters(-eutil      => 'efetch',
                         -db         => 'protein',
                         -rettype    => 'fasta',
                         -email      => 'myfoo@bar.com',
                         -id         => \@protein_ids);

print $eutil->get_Response->content;
ADD COMMENTlink modified 17 months ago by _r_am32k • written 10.3 years ago by Chris Fields2.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 971 users visited in the last hour
_