Question

Problem When Downloading Large Number Of Sequences From Genbank

1

Entering edit mode

12.0 years ago

Nick ▴ 290

I am trying to download all viral sequences (incl. partial genomes) from Genbank. The query:

http://www.ncbi.nlm.nih.gov/nuccore?term=viridae[Organism]

retrieves 1337194 sequences. When I try to download the resultset as a FASTA file I get files of various size - from 2MB to 100MB but in all cases containing only a fraction of the 1.3 million sequences. The largest file contains 62K sequences - that's only 5% of the total number in the result set. It seems that the download file is arbitrarily truncated. Do you know of any way to download all the sequences in the result set?

genbank • 8.5k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 12.0 years ago by Nick ▴ 290

0

Entering edit mode

did you try iterating over the resultset ids and fetching them one-by-one?

ADD REPLY • link 12.0 years ago by Jeremy Leipzig 22k

1

Entering edit mode

I am not sure NCBI would tolerate 1.3 million individual requests

ADD REPLY • link 12.0 years ago by Nick ▴ 290

score 2 · Answer 1 · 2012-05-02

Use ePost if you are familiar with Python. Have a look here for instructions.

EDIT: Here you have code that will do the job. In my computer it fetches 1k sequences every 20-30 seconds.

#!/usr/bin/env python
"""Fetch all entries for viridae from NCBI."""

import sys
from Bio      import Entrez
from datetime import datetime

Entrez.email = "your_email@gmail.com" 
query="viridae[Organism]"
db="nuccore"
retmax=10**9
retmode='fasta'
rettype='text'
batchSize=1000

#get list of entries for given query
sys.stderr.write( "Getting list of GIs for term=%s ...\n" % query )
handle = Entrez.esearch( db=db,term=query,retmax=retmax )
giList = Entrez.read(handle)['IdList']

#print info about number of proteins
sys.stderr.write( "Downloading %s entries from NCBI %s database in batches of %s entries...\n" % ( len(giList),db,batchSize ) )

#post NCBI query
search_handle     = Entrez.epost( db, id=",".join( giList ) )
search_results    = Entrez.read( search_handle )
webenv,query_key  = search_results["WebEnv"], search_results["QueryKey"] 

#fecth all results in batch of batchSize entries at once
for start in range( 0,len(giList),batchSize ):
  #print info
  tnow = datetime.now()
  sys.stderr.write( "\t%s\t%s / %s\n" % ( datetime.ctime(tnow),start,len(giList) ) )
  handle = Entrez.efetch( db=db,retmode=retmode,rettype=rettype,retstart=start,retmax=batchSize,webenv=webenv,query_key=query_key )
  sys.stdout.write( handle.read() )

score 2 · Answer 2 · 2013-07-15

You can do this fairly simply using ENA's advanced search facility as follows:

Go to the ENA homepage
Click on "Advanced search" to the right of the box labelled "text search" (should take you here)
Select the "Taxon" radio button. This will display a new section on the page, near the bottom, which allows you to narrow down your query. In your case, enter "viridae" next to "Taxon name = " and select the suggestion from the drop down. It should auto enter "tax_eq(10239)" in the query box at the top.
Click on "Search"
A results page will be displayed for each of the divisions of the nucleotide archive. I presume you want the original sequences, so click on the hyperlinked number in the column Taxon & decendants > Entries that corresponds with the row "Sequence (Release)" (currently, just over 1.5 million entries)
This will display a paginated list of all the ~1.5 million entries currently in ENA that are from that taxon (e.g. this page)
To download a fasta file of all the sequences, click on the "FASTA" link at the right hand side, top of the list, next to Download.

Hope this helps - I think it's preferential to do it this way, rather than scripting against a resource to get things one by one. I''m pretty sure it should be possible to do this at NCBI too but unfortunately, I don't use that resource. If you have any feedback about the usefulness/user-friendliness of this feature, please let the ENA team know

score 0 · Answer 3 · 2012-05-03

0

Entering edit mode

12.0 years ago

Nick ▴ 290

Thanks, Leszek. The example you point to includes a list of the GIs. I am not sure this would work with a list containing 1.3 million GIs

ADD COMMENT • link 12.0 years ago by Nick ▴ 290

0

Entering edit mode

it will. I got this way several millions of proteins.

ADD REPLY • link 12.0 years ago by Leszek 4.2k

0

Entering edit mode

Hi nikolay12. This is not an answer to your original question and should not be posted as such. Please use the 'comment' button below the answer of Leszek to add this comment. (This answer will be deleted)

ADD REPLY • link 10.8 years ago by Eric Normandeau 11k

score 0 · Answer 4 · 2016-11-21

Hi everyone! I was trying to do something similar to Nick. I want to download GenBank results from 12S eukaryote. Since the normal website download does not work (I tried several times but I can get only 66% of the sequences) I wanted to try using the script Leszek posted. This is my code, slightly adjusted from Leszek:

enter code here#!/usr/bin/env python

import sys
from Bio      import Entrez
from datetime import datetime

faa_output = "12S_eukaryotesGB.txt" 
output_handle = open(faa_output, "w")
Entrez.email = "email@blabla.mail"
query='12S[All Fields] AND ("Eukaryota"[Organism] OR eukaryota[All Fields])' 
db="nucleotide"             
retmax=10**9
retmode='text'        
rettype='gb'        
batchSize=1000

#get list of entries for given query
sys.stderr.write( "Getting list of GIs for term=%s ...\n" % query )
handle = Entrez.esearch( db=db,term=query,retmax=retmax )      
giList = Entrez.read(handle)['IdList']

#print info about number of proteins
sys.stderr.write( "Downloading %s entries from NCBI %s database in batches of %s entries...\n" % ( len(giList),db,batchSize ) )     
#post NCBI query
search_handle     = Entrez.epost( db, id=",".join( giList ) )   
search_results    = Entrez.read( search_handle )
webenv,query_key  = search_results["WebEnv"], search_results["QueryKey"] 

#fecth all results in batch of batchSize entries at once
for start in range( 0,len(giList),batchSize ):      
  #print info
  tnow = datetime.now()
  sys.stderr.write( "\t%s\t%s / %s\n" % ( datetime.ctime(tnow),start,len(giList) ) )
  handle = Entrez.efetch( db=db,retmode=retmode,rettype=rettype,retstart=start,retmax=batchSize,webenv=webenv,query_key=query_key )
  #sys.stdout.write( handle.read() )        
  output_handle.write(handle.read())        
output_handle.close()       
print "Saved"

The script works great but after retrieving 4000/186000 results I get this error, which I do not understand:

Traceback (most recent call last): File "script_epost_fetch_sequences.py", line 41, in <module> output_handle.write(handle.read()) File "/usr/lib64/python2.7/socket.py", line 351, in read data = self._sock.recv(rbufsize)

File "/usr/lib64/python2.7/httplib.py", line 578, in read return self._read_chunked(amt)

File "/usr/lib64/python2.7/httplib.py", line 632, in _read_chunked raise IncompleteRead(''.join(value))

httplib.IncompleteRead: IncompleteRead(2450 bytes read)

Can anyone help me with this error?