Question: Get all bacterial protein sequences from Genbank but non redundant
gravatar for mschmid
2.6 years ago by
mschmid110 wrote:

I would like to download all bacterial proteins from Genbank, but non redundant. Meaning that the entries which are exactly (!) identical get removed.

What is the most efficient way to do this?

I could use taxonomy browser and then get GIs of all bacterial products. But then it is redundant. The other problem is that I can not download the file with GIs since it seems to be too big. If I retrieve NCBI protein clusters I get entries removed which are not exactly the same, right?

Is it possible to get the faa sequences from all bacterial sequences in the nr blast db as a fasta via blastdb_aliastool?

Or do I have to use eutils?

protein bacteria ncbi • 930 views
ADD COMMENTlink modified 2.5 years ago • written 2.6 years ago by mschmid110
gravatar for mschmid
2.5 years ago by
mschmid110 wrote:

I used this script. There will be some redundancy. But you can remove this later. Also, the reply from NCBI's server can have some error messages containted in the output. Filter for that!

#!/usr/bin/perl -w

use strict;
use warnings;
use LWP::Simple;
use warnings FATAL => 'all';

my $utils = "";
my $db = ask_user("Database", "nuccore|nucest|protein|pubmed");
my $query = ask_user("Query", "Entrez query");
my $report = ask_user("Report", "fasta|genbank|abstract|acc");
my $esearch = "$utils/esearch.fcgi?" . "db=$db&usehistory=y&term=";
my $esearch_result = get($esearch . $query);
$esearch_result =~ m|.*<Count>(\d+).*<RetMax>(\d+).*<RetStart>(\d+).*<QueryKey>(\d+).*<WebEnv>(\S+)</WebEnv>.*|s;
print STDERR $esearch_result . "\n";
my $Count = $1;
my $retmax=1000;
my $retstart=$3;
my $QueryKey = $4;
my $WebEnv = $5;
my $efetch_result;
my $copy;
my $countSeqs;
my $expected;
print STDERR "Count = $Count; Retmax = $retmax; Retstart = $retstart; QueryKey = $QueryKey; WebEnv = $WebEnv\n";

while ($retstart<$Count)
  my $efetch = "$utils/efetch.fcgi?" . "rettype=$report&retmode=text&retstart=$retstart&retmax=$retmax&" . "db=$db&query_key=$QueryKey&WebEnv=$WebEnv";
  print STDERR "RETRIEVING $retmax results starting at result $retstart\n";
  my $myresult = eval {
    $efetch_result = get($efetch);
    if ($report eq 'fasta') { $countSeqs= $copy =~ tr/\>//; }
    elsif ($report eq 'genbank') { $countSeqs = $copy =~ tr/\/\///; }
    elsif ($report eq 'acc') { $countSeqs = $copy =~ tr/\n//; }
    if ($retstart>$Count-$retmax) { $expected=$Count-$retstart; }
    print STDERR "PERL EVAL ERROR...TRYING AGAIN:" . "\n" . $@ . "\n";
    if ($countSeqs>=($expected-1000)) { print "$efetch_result"; $retstart+=$retmax; }
    else { print STDERR "ERROR...TRYING AGAIN ($countSeqs / $expected)\n"; }

sub ask_user { print STDERR "$_[0] [$_[1]]: ";

my $rc = <>;
chomp $rc;
if($rc eq "")
{ die "Error: Empty field: $_[0]\n"; }
  return $rc;
ADD COMMENTlink written 2.5 years ago by mschmid110
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1127 users visited in the last hour