Question: Get all bacterial protein sequences from Genbank but non redundant
gravatar for mschmid
22 months ago by
mschmid90 wrote:

I would like to download all bacterial proteins from Genbank, but non redundant. Meaning that the entries which are exactly (!) identical get removed.

What is the most efficient way to do this?

I could use taxonomy browser and then get GIs of all bacterial products. But then it is redundant. The other problem is that I can not download the file with GIs since it seems to be too big. If I retrieve NCBI protein clusters I get entries removed which are not exactly the same, right?

Is it possible to get the faa sequences from all bacterial sequences in the nr blast db as a fasta via blastdb_aliastool?

Or do I have to use eutils?

protein bacteria ncbi • 739 views
ADD COMMENTlink modified 22 months ago • written 22 months ago by mschmid90
gravatar for mschmid
22 months ago by
mschmid90 wrote:

I used this script. There will be some redundancy. But you can remove this later. Also, the reply from NCBI's server can have some error messages containted in the output. Filter for that!

#!/usr/bin/perl -w

use strict;
use warnings;
use LWP::Simple;
use warnings FATAL => 'all';

my $utils = "";
my $db = ask_user("Database", "nuccore|nucest|protein|pubmed");
my $query = ask_user("Query", "Entrez query");
my $report = ask_user("Report", "fasta|genbank|abstract|acc");
my $esearch = "$utils/esearch.fcgi?" . "db=$db&usehistory=y&term=";
my $esearch_result = get($esearch . $query);
$esearch_result =~ m|.*<Count>(\d+).*<RetMax>(\d+).*<RetStart>(\d+).*<QueryKey>(\d+).*<WebEnv>(\S+)</WebEnv>.*|s;
print STDERR $esearch_result . "\n";
my $Count = $1;
my $retmax=1000;
my $retstart=$3;
my $QueryKey = $4;
my $WebEnv = $5;
my $efetch_result;
my $copy;
my $countSeqs;
my $expected;
print STDERR "Count = $Count; Retmax = $retmax; Retstart = $retstart; QueryKey = $QueryKey; WebEnv = $WebEnv\n";

while ($retstart<$Count)
  my $efetch = "$utils/efetch.fcgi?" . "rettype=$report&retmode=text&retstart=$retstart&retmax=$retmax&" . "db=$db&query_key=$QueryKey&WebEnv=$WebEnv";
  print STDERR "RETRIEVING $retmax results starting at result $retstart\n";
  my $myresult = eval {
    $efetch_result = get($efetch);
    if ($report eq 'fasta') { $countSeqs= $copy =~ tr/\>//; }
    elsif ($report eq 'genbank') { $countSeqs = $copy =~ tr/\/\///; }
    elsif ($report eq 'acc') { $countSeqs = $copy =~ tr/\n//; }
    if ($retstart>$Count-$retmax) { $expected=$Count-$retstart; }
    print STDERR "PERL EVAL ERROR...TRYING AGAIN:" . "\n" . $@ . "\n";
    if ($countSeqs>=($expected-1000)) { print "$efetch_result"; $retstart+=$retmax; }
    else { print STDERR "ERROR...TRYING AGAIN ($countSeqs / $expected)\n"; }

sub ask_user { print STDERR "$_[0] [$_[1]]: ";

my $rc = <>;
chomp $rc;
if($rc eq "")
{ die "Error: Empty field: $_[0]\n"; }
  return $rc;
ADD COMMENTlink written 22 months ago by mschmid90
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1124 users visited in the last hour