Question

Entrez E-Utils Script Returns Different Numbers Of Sequences Each Rn

1

Entering edit mode

12.8 years ago

Uncle Gabby ▴ 20

Hello there,

I am trying to retrieve all sequences for a particular taxon from the NCBI. I found a script in a guide published by the NCBI on using entrez e-utilities: http://www.ncbi.nlm.nih.gov/books/NBK25498/ (application #3). I will paste the script below.

When I run the script I expect to get >50,000 chimp sequences, but I don't. In fact, each time I run it I get a different number of returned sequences. The same occurs when I try and get other taxa. Any ideas on why this script returns a different value each time? Is there another way to retrieve all sequences from a particular group?

I would appreciate any help anyone can offer. Thank you!

use LWP::Simple;
$query = 'chimpanzee[orgn]+AND+biomol+mrna[prop]';


    #assemble the esearch URL
    $base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
    $url = $base . "esearch.fcgi?db=nucleotide&term=$query&usehistory=y";


    #post the esearch URL
    $output = get($url);


    #parse WebEnv, QueryKey and Count (# records retrieved)
    $web = $1 if ($output =~ /<WebEnv>(\S+)<\/WebEnv>/);
    $key = $1 if ($output =~ /<QueryKey>(\d+)<\/QueryKey>/);
    $count = $1 if ($output =~ /<Count>(\d+)<\/Count>/);


    #open output file for writing
    open(OUT, ">chimp.fna") || die "Can't open file!\n";


    #retrieve data in batches of 500
    $retmax = 500;
    for ($retstart = 0; $retstart < $count; $retstart += $retmax) {
            $efetch_url = $base ."efetch.fcgi?db=nucleotide&WebEnv=$web";
            $efetch_url .= "&query_key=$key&retstart=$retstart";
            $efetch_url .= "&retmax=$retmax&rettype=fasta&retmode=text";
            $efetch_out = get($efetch_url);
            print OUT "$efetch_out";
    }
    close OUT;

entrez script retrieval ncbi • 4.0k views

ADD COMMENT • link updated 11.4 years ago by Giovanni M Dall'Olio 28k • written 12.8 years ago by Uncle Gabby ▴ 20

score 4 · Answer 1 · 2012-10-04

The examples on that site are a bit overcomplicated, they involve history etc, not sure why. You might want to check out How can I Bulk retrieve refseq fasta files based on organism name using EUtilities?

I also rewrote your code into Python (with BioPython) to see just how it all works on large scale. Occasionally you may get an error from the web, set the skip to the last displayed chunk number to continue. At the end just do a cat *.fa > full.fa

from itertools import count
from Bio import Entrez

Entrez.email = "hello@example.com"

# search Entrez
handle = Entrez.esearch(db="nucleotide", term="chimpanzee[orgn] AND biomol mrna[prop]", retmax=100000)
record = Entrez.read(handle)
ids = list(record["IdList"])

print '*** found %s records' % len(ids)

# this is the step size
step  = 500

# change this to skip up to this step
skip  = 0

chunks=[ ids[x:x+step] for x in xrange(0, len(ids), step)]

# this to skip on error
chunks = chunks[skip:]

# fetch in chunks and place into file
for index, chunk in zip(count(skip), chunks):
    out = file("data-%02d.fa" % index, 'wt')
    print "*** chunk %s" % index
    stream = Entrez.efetch(db="nucleotide", id=chunk, rettype="fasta", retmode="text")
    out.write(stream.read())

score 2 · Answer 2 · 2014-02-28

My guess is that your Internet connection falls during the download, so you get incomplete files, and the NCBI utilities do not give any warning about it.

This is probably because you are downloading a lot of sequences, without waiting between queries, so the NCBI server gets saturated and cancels some of the downloads. One thing that you can do is to add a sleep() command between queries, to wait at least five or ten seconds before each download. If you read the NCBI E-Utilities terms of services, they actually ask you to put a sleep command between queries.

p.s. you can also have a look at the Ncbi Releases Entrez Direct, The Entrez Utilities On The Unix Command Line released earlier this year, they should be a bit easier to use (but still require to wait some seconds between queries).