Question

NCBI Protein GI to Genome Accession

4

Entering edit mode

10.1 years ago

Sej Modha 5.3k

I have a list of protein GI and would like to get accession number of the genome (DBSOURCE in genbank file) using Eutilities.

What would be the easiest way to get genome accession numbers for a list of protein GI?

sequence eutils ncbi • 9.5k views

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.1 years ago by Sej Modha 5.3k

0

Entering edit mode

5.2 years ago

josev.die ▴ 70

Using #R :

# Dependencies
library(rentrez) 

# NCBI Protein GI to Genome Accession
gi = 817524604
gi_link = entrez_link(dbfrom = 'protein', db = 'nuccore', id = gi)
nucc_id = gi_link$links$protein_nuccore
nucc = entrez_summary(db = 'nuccore', id = nucc_id)
nucc$caption

ADD COMMENT • link 5.2 years ago by josev.die ▴ 70

score 6 · Accepted Answer · 2016-04-29

6

Entering edit mode

9.2 years ago

Sej Modha 5.3k

An alternative to perl e-util is to use the Unix e-utils. Following one liner does the job! Note that following command would work for accession number as well as GIs as -id parameter in elink command.

elink -db protein -id 817524604 -target nuccore|efetch -format acc

ADD COMMENT • link 9.2 years ago by Sej Modha 5.3k

Ram · Accepted Answer · 2015-06-22

5

Entering edit mode

10.1 years ago

GenoMax 152k

For future reference:

There is an interesting file @NCBI which contains information about various accession numbers/gene names: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz

This file is automatically generated everyday and should contain latest information about GeneID's and accession numbers.

ADD COMMENT • link 10.1 years ago by GenoMax 152k

0

Entering edit mode

Just wanted to say thanks for this, this saved me a massive amount of time genomax2!

I threw together a makefile and a small python script to convert the 4.8Gb uncompressed file into a kyoto-cabinet database in case anyone wants it.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.6 years ago by rasche.eric ▴ 70

Ram · Accepted Answer · 2015-06-22

Here is a perl script to do this. This script is explained in detail here

#!/usr/bin/perl

use Bio::DB::EUtilities;

#my @ids     = qw(817524604 726965494);
my $infile = $ARGV[0];
my @ids;

open (IN,"$infile")||die "can't open $infile\n";

while(<IN>)
{
  chomp($_);
  my @ids=$_;
# print @ids."\n";    
  my $factory = Bio::DB::EUtilities->new(-eutil          => 'elink',
                                       -email          => 'mymail@foo.bar',
                                       -db             => 'nucleotide',
                                       -dbfrom         => 'protein',
                                       -correspondence => 1,
                                       -id             => \@ids);

  # iterate through the LinkSet objects
  while (my $ds = $factory->next_LinkSet) 
  {
    #print "   Link name: ",$ds->get_link_name,"\n";
    my $protid = join(',',$ds->get_submitted_ids);
    print "Protein ID:" . $protid ."\t";
    #print "Protein ID: ",join(',',$ds->get_submitted_ids),"\t";
    my $nucid = join(',',$ds->get_ids);
    print "Nuc ID:" . $nucid ."\t";
    my $factory = Bio::DB::EUtilities->new(-eutil   => 'efetch',
                           -db      => 'nucleotide',
                                              -id      => $nucid,
                                              -email   => 'mymail@foo.bar',
                                              -rettype => 'acc');
   my @accs = split(m{\n},$factory->get_Response->content);
   print "Genome Accession: " .join(',',@accs), "\n";
   }
}