Question: Entrez E-Utils Script Returns Different Numbers Of Sequences Each Rn
1
gravatar for Uncle Gabby
6.5 years ago by
Uncle Gabby20
Uncle Gabby20 wrote:

Hello there,

I am trying to retrieve all sequences for a particular taxon from the NCBI. I found a script in a guide published by the NCBI on using entrez e-utilities: http://www.ncbi.nlm.nih.gov/books/NBK25498/ (application #3). I will paste the script below.

When I run the script I expect to get >50,000 chimp sequences, but I don't. In fact, each time I run it I get a different number of returned sequences. The same occurs when I try and get other taxa. Any ideas on why this script returns a different value each time? Is there another way to retrieve all sequences from a particular group?

I would appreciate any help anyone can offer. Thank you!

use LWP::Simple;
$query = 'chimpanzee[orgn]+AND+biomol+mrna[prop]';


    #assemble the esearch URL
    $base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
    $url = $base . "esearch.fcgi?db=nucleotide&term=$query&usehistory=y";


    #post the esearch URL
    $output = get($url);


    #parse WebEnv, QueryKey and Count (# records retrieved)
    $web = $1 if ($output =~ /<WebEnv>(\S+)<\/WebEnv>/);
    $key = $1 if ($output =~ /<QueryKey>(\d+)<\/QueryKey>/);
    $count = $1 if ($output =~ /<Count>(\d+)<\/Count>/);


    #open output file for writing
    open(OUT, ">chimp.fna") || die "Can't open file!\n";


    #retrieve data in batches of 500
    $retmax = 500;
    for ($retstart = 0; $retstart < $count; $retstart += $retmax) {
            $efetch_url = $base ."efetch.fcgi?db=nucleotide&WebEnv=$web";
            $efetch_url .= "&query_key=$key&retstart=$retstart";
            $efetch_url .= "&retmax=$retmax&rettype=fasta&retmode=text";
            $efetch_out = get($efetch_url);
            print OUT "$efetch_out";
    }
    close OUT;
ncbi entrez script retrieval • 2.4k views
ADD COMMENTlink modified 5.1 years ago by Giovanni M Dall'Olio26k • written 6.5 years ago by Uncle Gabby20
4
gravatar for Istvan Albert
6.5 years ago by
Istvan Albert ♦♦ 80k
University Park, USA
Istvan Albert ♦♦ 80k wrote:

The examples on that site are a bit overcomplicated, they involve history etc, not sure why. You might want to check out How can I Bulk retrieve refseq fasta files based on organism name using EUtilities?

I also rewrote your code into Python (with BioPython) to see just how it all works on large scale. Occasionally you may get an error from the web, set the skip to the last displayed chunk number to continue. At the end just do a cat *.fa > full.fa

from itertools import count
from Bio import Entrez

Entrez.email = "hello@example.com"

# search Entrez
handle = Entrez.esearch(db="nucleotide", term="chimpanzee[orgn] AND biomol mrna[prop]", retmax=100000)
record = Entrez.read(handle)
ids = list(record["IdList"])

print '*** found %s records' % len(ids)

# this is the step size
step  = 500

# change this to skip up to this step
skip  = 0

chunks=[ ids[x:x+step] for x in xrange(0, len(ids), step)]

# this to skip on error
chunks = chunks[skip:]

# fetch in chunks and place into file
for index, chunk in zip(count(skip), chunks):
    out = file("data-%02d.fa" % index, 'wt')
    print "*** chunk %s" % index
    stream = Entrez.efetch(db="nucleotide", id=chunk, rettype="fasta", retmode="text")
    out.write(stream.read())
ADD COMMENTlink modified 6.5 years ago • written 6.5 years ago by Istvan Albert ♦♦ 80k
2
gravatar for Giovanni M Dall'Olio
5.1 years ago by
London, UK
Giovanni M Dall'Olio26k wrote:

My guess is that your Internet connection falls during the download, so you get incomplete files, and the NCBI utilities do not give any warning about it.

This is probably because you are downloading a lot of sequences, without waiting between queries, so the NCBI server gets saturated and cancels some of the downloads. One thing that you can do is to add a sleep() command between queries, to wait at least five or ten seconds before each download. If you read the NCBI E-Utilities terms of services, they actually ask you to put a sleep command between queries.

p.s. you can also have a look at the Ncbi Releases Entrez Direct, The Entrez Utilities On The Unix Command Line released earlier this year, they should be a bit easier to use (but still require to wait some seconds between queries).

ADD COMMENTlink modified 5.1 years ago • written 5.1 years ago by Giovanni M Dall'Olio26k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 933 users visited in the last hour