Question: Problem When Downloading Large Number Of Sequences From Genbank
1
gravatar for Nick
7.5 years ago by
Nick270
Spain
Nick270 wrote:

I am trying to download all viral sequences (incl. partial genomes) from Genbank. The query:

http://www.ncbi.nlm.nih.gov/nuccore?term=viridae[Organism]

retrieves 1337194 sequences. When I try to download the resultset as a FASTA file I get files of various size - from 2MB to 100MB but in all cases containing only a fraction of the 1.3 million sequences. The largest file contains 62K sequences - that's only 5% of the total number in the result set. It seems that the download file is arbitrarily truncated. Do you know of any way to download all the sequences in the result set?

genbank file download • 4.7k views
ADD COMMENTlink modified 2.9 years ago by luca0 • written 7.5 years ago by Nick270

did you try iterating over the resultset ids and fetching them one-by-one?

ADD REPLYlink written 7.5 years ago by Jeremy Leipzig18k
1

I am not sure NCBI would tolerate 1.3 million individual requests

ADD REPLYlink written 7.5 years ago by Nick270
2
gravatar for Leszek
7.5 years ago by
Leszek4.0k
IIMCB, Poland
Leszek4.0k wrote:

Use ePost if you are familiar with Python. Have a look here for instructions.

EDIT: Here you have code that will do the job. In my computer it fetches 1k sequences every 20-30 seconds.

#!/usr/bin/env python
"""Fetch all entries for viridae from NCBI."""

import sys
from Bio      import Entrez
from datetime import datetime

Entrez.email = "your_email@gmail.com" 
query="viridae[Organism]"
db="nuccore"
retmax=10**9
retmode='fasta'
rettype='text'
batchSize=1000

#get list of entries for given query
sys.stderr.write( "Getting list of GIs for term=%s ...\n" % query )
handle = Entrez.esearch( db=db,term=query,retmax=retmax )
giList = Entrez.read(handle)['IdList']

#print info about number of proteins
sys.stderr.write( "Downloading %s entries from NCBI %s database in batches of %s entries...\n" % ( len(giList),db,batchSize ) )

#post NCBI query
search_handle     = Entrez.epost( db, id=",".join( giList ) )
search_results    = Entrez.read( search_handle )
webenv,query_key  = search_results["WebEnv"], search_results["QueryKey"] 

#fecth all results in batch of batchSize entries at once
for start in range( 0,len(giList),batchSize ):
  #print info
  tnow = datetime.now()
  sys.stderr.write( "\t%s\t%s / %s\n" % ( datetime.ctime(tnow),start,len(giList) ) )
  handle = Entrez.efetch( db=db,retmode=retmode,rettype=rettype,retstart=start,retmax=batchSize,webenv=webenv,query_key=query_key )
  sys.stdout.write( handle.read() )
ADD COMMENTlink modified 7.5 years ago • written 7.5 years ago by Leszek4.0k

I have tried this code and it is giving me an Internal server error.

Getting list of GIs for term=Metazoa[Organism] ...
Downloading 22343398 entries from NCBI protein database in batches of 1000 entries...
Traceback (most recent call last):
  File "biopython.py", line 26, in <module>
    search_handle     = Entrez.epost( db, id=",".join( giList ) )
  File "/usr/lib/python3.6/site-packages/Bio/Entrez/__init__.py", line 129, in epost
    return _open(cgi, variables, post=True)
  File "/usr/lib/python3.6/site-packages/Bio/Entrez/__init__.py", line 526, in _open
    raise exception
  File "/usr/lib/python3.6/site-packages/Bio/Entrez/__init__.py", line 522, in _open
    handle = _urlopen(cgi, data=_as_bytes(options))
  File "/usr/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/usr/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 500: Internal Server Error

How to resolve it?

ADD REPLYlink written 2.5 years ago by AsoInfo230
2
gravatar for sarahhunter
6.3 years ago by
sarahhunter600
Cambridge, UK
sarahhunter600 wrote:

You can do this fairly simply using ENA's advanced search facility as follows:

  1. Go to the ENA homepage
  2. Click on "Advanced search" to the right of the box labelled "text search" (should take you here)
  3. Select the "Taxon" radio button. This will display a new section on the page, near the bottom, which allows you to narrow down your query. In your case, enter "viridae" next to "Taxon name = " and select the suggestion from the drop down. It should auto enter "tax_eq(10239)" in the query box at the top.
  4. Click on "Search"
  5. A results page will be displayed for each of the divisions of the nucleotide archive. I presume you want the original sequences, so click on the hyperlinked number in the column Taxon & decendants > Entries that corresponds with the row "Sequence (Release)" (currently, just over 1.5 million entries)
  6. This will display a paginated list of all the ~1.5 million entries currently in ENA that are from that taxon (e.g. this page)
  7. To download a fasta file of all the sequences, click on the "FASTA" link at the right hand side, top of the list, next to Download.

Hope this helps - I think it's preferential to do it this way, rather than scripting against a resource to get things one by one. I''m pretty sure it should be possible to do this at NCBI too but unfortunately, I don't use that resource. If you have any feedback about the usefulness/user-friendliness of this feature, please let the ENA team know

ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by sarahhunter600
0
gravatar for Nick
7.5 years ago by
Nick270
Spain
Nick270 wrote:

Thanks, Leszek. The example you point to includes a list of the GIs. I am not sure this would work with a list containing 1.3 million GIs

ADD COMMENTlink written 7.5 years ago by Nick270

it will. I got this way several millions of proteins.

ADD REPLYlink written 7.5 years ago by Leszek4.0k

Hi nikolay12. This is not an answer to your original question and should not be posted as such. Please use the 'comment' button below the answer of Leszek to add this comment. (This answer will be deleted)

ADD REPLYlink written 6.3 years ago by Eric Normandeau10k
0
gravatar for luca
2.9 years ago by
luca0
Bermuda Institute of Ocean Sciences
luca0 wrote:

Hi everyone! I was trying to do something similar to Nick. I want to download GenBank results from 12S eukaryote. Since the normal website download does not work (I tried several times but I can get only 66% of the sequences) I wanted to try using the script Leszek posted. This is my code, slightly adjusted from Leszek:

enter code here#!/usr/bin/env python

import sys
from Bio      import Entrez
from datetime import datetime

faa_output = "12S_eukaryotesGB.txt" 
output_handle = open(faa_output, "w")
Entrez.email = "email@blabla.mail"
query='12S[All Fields] AND ("Eukaryota"[Organism] OR eukaryota[All Fields])' 
db="nucleotide"             
retmax=10**9
retmode='text'        
rettype='gb'        
batchSize=1000

#get list of entries for given query
sys.stderr.write( "Getting list of GIs for term=%s ...\n" % query )
handle = Entrez.esearch( db=db,term=query,retmax=retmax )      
giList = Entrez.read(handle)['IdList']

#print info about number of proteins
sys.stderr.write( "Downloading %s entries from NCBI %s database in batches of %s entries...\n" % ( len(giList),db,batchSize ) )     
#post NCBI query
search_handle     = Entrez.epost( db, id=",".join( giList ) )   
search_results    = Entrez.read( search_handle )
webenv,query_key  = search_results["WebEnv"], search_results["QueryKey"] 

#fecth all results in batch of batchSize entries at once
for start in range( 0,len(giList),batchSize ):      
  #print info
  tnow = datetime.now()
  sys.stderr.write( "\t%s\t%s / %s\n" % ( datetime.ctime(tnow),start,len(giList) ) )
  handle = Entrez.efetch( db=db,retmode=retmode,rettype=rettype,retstart=start,retmax=batchSize,webenv=webenv,query_key=query_key )
  #sys.stdout.write( handle.read() )        
  output_handle.write(handle.read())        
output_handle.close()       
print "Saved"

The script works great but after retrieving 4000/186000 results I get this error, which I do not understand:

Traceback (most recent call last): File "script_epost_fetch_sequences.py", line 41, in <module> output_handle.write(handle.read()) File "/usr/lib64/python2.7/socket.py", line 351, in read data = self._sock.recv(rbufsize)

File "/usr/lib64/python2.7/httplib.py", line 578, in read return self._read_chunked(amt)

File "/usr/lib64/python2.7/httplib.py", line 632, in _read_chunked raise IncompleteRead(''.join(value))

httplib.IncompleteRead: IncompleteRead(2450 bytes read)

Can anyone help me with this error?

ADD COMMENTlink written 2.9 years ago by luca0

Could be that your connection timed-out, when I've tried to download large data sets from ncbi via efetch, it often doesn't complete (presumably becuae ncbi gets so busy) - this post seems a little similar - HTTP Error 502-Biopython-Entrez Files

ADD REPLYlink written 2.9 years ago by Tonor420

Thanks Tonor! If that is the case then how can I prevent the connection to time-out? If I divide the entire giList into smaller lists and search and retrieve those lists one by one into separate files, do you think that that would do the job and solve the time-out problem?

ADD REPLYlink written 2.9 years ago by luca0

Did the code work for you? Because for me, the code is unable to retrieve sequences after 3000 accession ids.

ADD REPLYlink written 2.5 years ago by AsoInfo230
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2276 users visited in the last hour