Question

Get all bacterial protein sequences from Genbank but non redundant

0

Entering edit mode

8.1 years ago

mschmid ▴ 180

I would like to download all bacterial proteins from Genbank, but non redundant. Meaning that the entries which are exactly (!) identical get removed.

What is the most efficient way to do this?

I could use taxonomy browser and then get GIs of all bacterial products. But then it is redundant. The other problem is that I can not download the file with GIs since it seems to be too big. If I retrieve NCBI protein clusters I get entries removed which are not exactly the same, right?

Is it possible to get the faa sequences from all bacterial sequences in the nr blast db as a fasta via blastdb_aliastool?

Or do I have to use eutils?

ncbi bacteria protein • 1.9k views

ADD COMMENT • link 8.1 years ago by mschmid ▴ 180

score 1 · Answer 1 · 2017-05-30

I used this script. There will be some redundancy. But you can remove this later. Also, the reply from NCBI's server can have some error messages containted in the output. Filter for that!

#!/usr/bin/perl -w

use strict;
use warnings;
use LWP::Simple;
use warnings FATAL => 'all';

my $utils = "https://www.ncbi.nlm.nih.gov/entrez/eutils";
my $db = ask_user("Database", "nuccore|nucest|protein|pubmed");
my $query = ask_user("Query", "Entrez query");
my $report = ask_user("Report", "fasta|genbank|abstract|acc");
my $esearch = "$utils/esearch.fcgi?" . "db=$db&usehistory=y&term=";
my $esearch_result = get($esearch . $query);
$esearch_result =~ m|.*<Count>(\d+).*<RetMax>(\d+).*<RetStart>(\d+).*<QueryKey>(\d+).*<WebEnv>(\S+)</WebEnv>.*|s;
print STDERR $esearch_result . "\n";
my $Count = $1;
my $retmax=1000;
my $retstart=$3;
my $QueryKey = $4;
my $WebEnv = $5;
my $efetch_result;
my $copy;
my $countSeqs;
my $expected;
print STDERR "Count = $Count; Retmax = $retmax; Retstart = $retstart; QueryKey = $QueryKey; WebEnv = $WebEnv\n";

while ($retstart<$Count)
{
  my $efetch = "$utils/efetch.fcgi?" . "rettype=$report&retmode=text&retstart=$retstart&retmax=$retmax&" . "db=$db&query_key=$QueryKey&WebEnv=$WebEnv";
  print STDERR "RETRIEVING $retmax results starting at result $retstart\n";
  my $myresult = eval {
    $efetch_result = get($efetch);
    $copy=$efetch_result;
    $countSeqs=0;
    if ($report eq 'fasta') { $countSeqs= $copy =~ tr/\>//; }
    elsif ($report eq 'genbank') { $countSeqs = $copy =~ tr/\/\///; }
    elsif ($report eq 'acc') { $countSeqs = $copy =~ tr/\n//; }
    $expected=$retmax;
    if ($retstart>$Count-$retmax) { $expected=$Count-$retstart; }
  };
  if($@)
  {
    print STDERR "PERL EVAL ERROR...TRYING AGAIN:" . "\n" . $@ . "\n";
  }
  else
  {
    if ($countSeqs>=($expected-1000)) { print "$efetch_result"; $retstart+=$retmax; }
    else { print STDERR "ERROR...TRYING AGAIN ($countSeqs / $expected)\n"; }
  }
}

sub ask_user { print STDERR "$_[0] [$_[1]]: ";

my $rc = <>;
chomp $rc;
if($rc eq "")
{ die "Error: Empty field: $_[0]\n"; }
  return $rc;
}