Question: How Can I Programmatically Retrieve The Genbank Records With Accession Numbers In The Form Jn######?
2
gravatar for Jason Ebaugh
8.1 years ago by
Jason Ebaugh60
Jason Ebaugh60 wrote:

I am trying to use NCBIs E-utilities to retrieve sequences from GenBank. All the accession number are like this: JN###### (example: JN556047)

I am using Ebot to generate scripts to pull the records. The E-utilities server does not like the UIDs in the form of JN######. I tried dropping th "JN". It "worked" from the server's point of view, but it did not give me the correct records.

When I go to the web interface for the "Nucleotide" database, it accepts JN###### UIDs no problem. However, the web-based interface will only retrieve 100 records at a time, and I have 1700 to get.

How can I retrieve the GenBank records with accession numbers in the form JN######?

• 6.4k views
ADD COMMENTlink written 8.1 years ago by Jason Ebaugh60
8
gravatar for raunakms
7.8 years ago by
raunakms1.1k
San Francisco
raunakms1.1k wrote:

Here is a perl script that uses a BioPerl module "Bio::DB::GenBank". All the accession number must be present within the file accnumber.txt each separated my a comma or present in a new line. And also, file accnumber.txt must be present within the same directory as that of the perl-script. After successful execution it will generate a file sequence_download.fa containing the sequence in fasta format. If you want to retrieve any other data from the GenBank database use could just tweak the code looking up the same module "Bio::DB::GenBank".

#!usr/bin/perl -w

use strict;
use warnings;

use Bio::DB::GenBank;

my $input_file = 'accnumber.txt';
my $output_file = 'sequence_download.fa';

open (INPUT_FILE, $input_file);
open (OUTPUT_FILE, ">$output_file");

while(<INPUT_FILE>)
{
    chomp;

    my $line = $_;
    my @acc_no = split(",", $line);

    my $counter = 0;

    while ($acc_no[$counter])
    {
        $acc_no[$counter] =~ s/\s//g;

        if ($acc_no[$counter] =~ /^$/)
        {
            exit;
        }

        my $db_obj = Bio::DB::GenBank->new;

        my $seq_obj = $db_obj->get_Seq_by_acc($acc_no[$counter]);

        my $sequence1 = $seq_obj->seq;

        print OUTPUT_FILE ">"."$acc_no[$counter]","\n";

        print OUTPUT_FILE $sequence1,"\n";

        print "Sequence Downloaded:", "\t", $acc_no[$counter], "\n";

        $counter++;
    }
}

close OUTPUT_FILE;
close INPUT_FILE;

exit;
ADD COMMENTlink modified 7.8 years ago • written 7.8 years ago by raunakms1.1k
3
gravatar for Pierre Lindenbaum
8.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum124k wrote:

You can get a list of all the ACNs in ftp://ftp.ncbi.nih.gov/genbank/livelists/

you can then retrieve the ACN/version/gi using the following command:

$ curl -s "ftp://ftp.ncbi.nih.gov/genbank/livelists/GbAccList.1023.2011.gz" |\
   gunzip -c | egrep '^JN'

and retrieve each sequence using NCBI EFetch http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetch_help.html

ADD COMMENTlink written 8.1 years ago by Pierre Lindenbaum124k

Individual, and small sets, of LiveLists records can be retrieved using EMBL-EBI's dbfetch and WSDbfetch services. For example: http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=livelists&id=JN556047&style=raw

ADD REPLYlink written 7.8 years ago by Hamish3.1k
3
gravatar for Neilfws
8.1 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

The EUtils methods in BioRuby have no problem fetching JN* accessions:

#!/usr/bin/ruby
require 'rubygems'  # ruby 1.8
require 'bio'

Bio::NCBI.default_email = "me@me.com"
gb = Bio::NCBI::REST::EFetch.nucleotide("JN556047")

puts gb
# showing first few lines only
# LOCUS       JN556047                 164 bp    DNA     linear   INV 19-OCT-2011
# DEFINITION  Apis cerana isolate A0101 cGMP-dependent protein kinase foraging
#             (For) gene, exon 3 and partial cds.
# ACCESSION   JN556047
# VERSION     JN556047.1  GI:351634776
# KEYWORDS    .
# SOURCE      Apis cerana (Asiatic honeybee)
#   ORGANISM  Apis cerana
#             Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; Pterygota;
#             Neoptera; Endopterygota; Hymenoptera; Apocrita; Aculeata; Apoidea;
#             Apidae; Apis.
....
ADD COMMENTlink written 8.1 years ago by Neilfws48k
0
gravatar for Yannick Wurm
7.8 years ago by
Yannick Wurm2.3k
Queen Mary University London
Yannick Wurm2.3k wrote:

Here's a variant on what Neil replied, inspired from this question.

#!/bin/env ruby
require 'rubygems'
require 'bio'

Bio::NCBI.default_email = "xxxxx@qmul.ac.uk"
ncbi        = Bio::NCBI::REST.new
sequenceIDs = ("JP773711".."JP820231").to_a
sequences   = ncbi.efetch(ids = sequenceIDs,
                         {"db"=>"nuccore", 
                          "rettype"=>"fasta",
                          "retmax"=> 10000000})

# ncbi returns a single big string with records separated by two newlines  - we just want one.              
sequences.gsub!("\n\n", "\n")

File.open('Nylanderia_pubens_ests.fasta', 'w') {|f| f.write(sequences +"\n") }
puts "done."
ADD COMMENTlink modified 11 weeks ago by RamRS25k • written 7.8 years ago by Yannick Wurm2.3k
0
gravatar for Priyabrata
7.8 years ago by
Priyabrata70
Priyabrata70 wrote:

You can go to E-utilities @ NCBI (http://www.ncbi.nlm.nih.gov/books/NBK25500/). Use any programming language to retrieve, search.

ADD COMMENTlink written 7.8 years ago by Priyabrata70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2111 users visited in the last hour