Since you mention taxon ID, I assume that you're working with identifiers from NCBI Entrez.
To search and retrieve from Entrez using Perl, take a look at the Bioperl EUtilities Cookbook. You will have to decide which of the terms that you have would make good search terms and in addition, which database to search. For example, taxon ID is a "better" term than gene symbol (alaS
is a symbol, not an ID), because it is clearly defined and will return only results for that taxon. Gene symbol may be somewhat more ambiguous. However, not all databases allow taxon ID as a search term: for example the Gene database will, but the Protein database will not. In the latter case, search by organism name is allowed and could work well.
Here's an example lifted from the cookbook and adapted to search for AlaS from Butyrivibrio proteoclasticus:
use strict;
use Bio::DB::EUtilities;
# set optional history queue
my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch',
-email => 'mymail@foo.bar',
-db => 'protein',
-term => 'Butyrivibrio proteoclasticus[ORGN] AND alaS[Gene/Protein Name]',
-usehistory => 'y');
my $count = $factory->get_count;
# get history from queue
my $hist = $factory->next_History || die 'No history data returned\n';
print "History returned\n";
# note db carries over from above
$factory->set_parameters(-eutil => 'efetch',
-rettype => 'fasta',
-history => $hist);
my $retry = 0;
my ($retmax, $retstart) = (500,0);
open (my $out, '>', 'seqs.fa') || die "Can't open file:$!\n";
RETRIEVE_SEQS:
while ($retstart < $count) {
$factory->set_parameters(-retmax => $retmax,
-retstart => $retstart);
eval{
$factory->get_Response(-cb => sub {my ($data) = @_; print $out $data} );
};
if ($@) {
die "Server error: $@. Try again later\n" if $retry == 5;
print STDERR "Server error, redo #$retry\n";
$retry++ && redo RETRIEVE_SEQS;
}
print "Retrieved $retstart\n";
$retstart += $retmax;
}
close $out;
If you save that as efetch.pl
and run it, it should print Retrieved 0
(a bit misleading!) and write out a file, seqs.fa
, with 2 AlaS protein sequences in FASTA format.
Now all you need to do is write the loop which supplies organisms and gene symbols.
As mentioned, the Gene database allows search by taxon ID: if you went that route, you'd need to figure out how to get to a protein entry (or how to return a CDS which you can then translate).
thank you neil! it's very useful, but just other questions: -have I to write two different loops to obtain 1)one fasta file for the 1st gene (i.e "alaS.fa") with all organisms sequences (65); 2)repeat this process for all genes? -in the case an organism has not alaS gene (example!), does this script return an error?
I don't think this is great code - it's copied more or less as-is from the website. You could use the get_count method to check if the search generated results and then not run efetch if there were none. Looping should be easy: you just have put all of the above ()except the 'use' lines) in the loop and pass organism name (or taxid) + gene symbol as parameters to $factory. I'd just play with it and see what happens, best way to learn the code.
I don't think this is great code - it's copied more or less as-is from the website. You could use the get_count method to check if the search generated results and then not run efetch if there were none. Looping should be easy: you just have put all of the above (except the 'use' lines) in the loop and pass organism name (or taxid) + gene symbol as parameters to $factory. I'd just play with it and see what happens, best way to learn the code.
You can assume the annotation for the genes for all your genomes of interest are correct, and that a search query would give you what you want. I have found that assumption tends to be erroneous. Annotation pipelines may be conservative and not assign a function to predicted CDS w/o more evidence, or may not assign any function (tag them all hypothetical). Or (even worse) mis-assign the function you are seeking to the wrong gene. This also may not catch things you might want, such as paralogous sequences (ex. argS1, argS2).
Have you looked at ortholog databases? OrthoDB? InPararanoid?
Or (if you like NCBI and Bio::DB::EUtilities), search their Protein Clusters?
Also, just noticed that should be 'alaS', not 'argS' (sorry, happened to work in the arginine pathway in my youth.