Problem When Downloading Large Number Of Sequences From Genbank
4
1
Entering edit mode
12.0 years ago
Nick ▴ 290

I am trying to download all viral sequences (incl. partial genomes) from Genbank. The query:

http://www.ncbi.nlm.nih.gov/nuccore?term=viridae[Organism]

retrieves 1337194 sequences. When I try to download the resultset as a FASTA file I get files of various size - from 2MB to 100MB but in all cases containing only a fraction of the 1.3 million sequences. The largest file contains 62K sequences - that's only 5% of the total number in the result set. It seems that the download file is arbitrarily truncated. Do you know of any way to download all the sequences in the result set?

genbank • 8.5k views
ADD COMMENT
0
Entering edit mode

did you try iterating over the resultset ids and fetching them one-by-one?

ADD REPLY
1
Entering edit mode

I am not sure NCBI would tolerate 1.3 million individual requests

ADD REPLY
2
Entering edit mode
12.0 years ago
Leszek 4.2k

Use ePost if you are familiar with Python. Have a look here for instructions.

EDIT: Here you have code that will do the job. In my computer it fetches 1k sequences every 20-30 seconds.

#!/usr/bin/env python
"""Fetch all entries for viridae from NCBI."""

import sys
from Bio      import Entrez
from datetime import datetime

Entrez.email = "your_email@gmail.com" 
query="viridae[Organism]"
db="nuccore"
retmax=10**9
retmode='fasta'
rettype='text'
batchSize=1000

#get list of entries for given query
sys.stderr.write( "Getting list of GIs for term=%s ...\n" % query )
handle = Entrez.esearch( db=db,term=query,retmax=retmax )
giList = Entrez.read(handle)['IdList']

#print info about number of proteins
sys.stderr.write( "Downloading %s entries from NCBI %s database in batches of %s entries...\n" % ( len(giList),db,batchSize ) )

#post NCBI query
search_handle     = Entrez.epost( db, id=",".join( giList ) )
search_results    = Entrez.read( search_handle )
webenv,query_key  = search_results["WebEnv"], search_results["QueryKey"] 

#fecth all results in batch of batchSize entries at once
for start in range( 0,len(giList),batchSize ):
  #print info
  tnow = datetime.now()
  sys.stderr.write( "\t%s\t%s / %s\n" % ( datetime.ctime(tnow),start,len(giList) ) )
  handle = Entrez.efetch( db=db,retmode=retmode,rettype=rettype,retstart=start,retmax=batchSize,webenv=webenv,query_key=query_key )
  sys.stdout.write( handle.read() )
ADD COMMENT
0
Entering edit mode

I have tried this code and it is giving me an Internal server error.

Getting list of GIs for term=Metazoa[Organism] ...
Downloading 22343398 entries from NCBI protein database in batches of 1000 entries...
Traceback (most recent call last):
  File "biopython.py", line 26, in <module>
    search_handle     = Entrez.epost( db, id=",".join( giList ) )
  File "/usr/lib/python3.6/site-packages/Bio/Entrez/__init__.py", line 129, in epost
    return _open(cgi, variables, post=True)
  File "/usr/lib/python3.6/site-packages/Bio/Entrez/__init__.py", line 526, in _open
    raise exception
  File "/usr/lib/python3.6/site-packages/Bio/Entrez/__init__.py", line 522, in _open
    handle = _urlopen(cgi, data=_as_bytes(options))
  File "/usr/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/usr/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 500: Internal Server Error

How to resolve it?

ADD REPLY
2
Entering edit mode
10.8 years ago
sarahhunter ▴ 600

You can do this fairly simply using ENA's advanced search facility as follows:

  1. Go to the ENA homepage
  2. Click on "Advanced search" to the right of the box labelled "text search" (should take you here)
  3. Select the "Taxon" radio button. This will display a new section on the page, near the bottom, which allows you to narrow down your query. In your case, enter "viridae" next to "Taxon name = " and select the suggestion from the drop down. It should auto enter "tax_eq(10239)" in the query box at the top.
  4. Click on "Search"
  5. A results page will be displayed for each of the divisions of the nucleotide archive. I presume you want the original sequences, so click on the hyperlinked number in the column Taxon & decendants > Entries that corresponds with the row "Sequence (Release)" (currently, just over 1.5 million entries)
  6. This will display a paginated list of all the ~1.5 million entries currently in ENA that are from that taxon (e.g. this page)
  7. To download a fasta file of all the sequences, click on the "FASTA" link at the right hand side, top of the list, next to Download.

Hope this helps - I think it's preferential to do it this way, rather than scripting against a resource to get things one by one. I''m pretty sure it should be possible to do this at NCBI too but unfortunately, I don't use that resource. If you have any feedback about the usefulness/user-friendliness of this feature, please let the ENA team know

ADD COMMENT
0
Entering edit mode
12.0 years ago
Nick ▴ 290

Thanks, Leszek. The example you point to includes a list of the GIs. I am not sure this would work with a list containing 1.3 million GIs

ADD COMMENT
0
Entering edit mode

it will. I got this way several millions of proteins.

ADD REPLY
0
Entering edit mode

Hi nikolay12. This is not an answer to your original question and should not be posted as such. Please use the 'comment' button below the answer of Leszek to add this comment. (This answer will be deleted)

ADD REPLY
0
Entering edit mode
7.4 years ago
luca • 0

Hi everyone! I was trying to do something similar to Nick. I want to download GenBank results from 12S eukaryote. Since the normal website download does not work (I tried several times but I can get only 66% of the sequences) I wanted to try using the script Leszek posted. This is my code, slightly adjusted from Leszek:

enter code here#!/usr/bin/env python

import sys
from Bio      import Entrez
from datetime import datetime

faa_output = "12S_eukaryotesGB.txt" 
output_handle = open(faa_output, "w")
Entrez.email = "email@blabla.mail"
query='12S[All Fields] AND ("Eukaryota"[Organism] OR eukaryota[All Fields])' 
db="nucleotide"             
retmax=10**9
retmode='text'        
rettype='gb'        
batchSize=1000

#get list of entries for given query
sys.stderr.write( "Getting list of GIs for term=%s ...\n" % query )
handle = Entrez.esearch( db=db,term=query,retmax=retmax )      
giList = Entrez.read(handle)['IdList']

#print info about number of proteins
sys.stderr.write( "Downloading %s entries from NCBI %s database in batches of %s entries...\n" % ( len(giList),db,batchSize ) )     
#post NCBI query
search_handle     = Entrez.epost( db, id=",".join( giList ) )   
search_results    = Entrez.read( search_handle )
webenv,query_key  = search_results["WebEnv"], search_results["QueryKey"] 

#fecth all results in batch of batchSize entries at once
for start in range( 0,len(giList),batchSize ):      
  #print info
  tnow = datetime.now()
  sys.stderr.write( "\t%s\t%s / %s\n" % ( datetime.ctime(tnow),start,len(giList) ) )
  handle = Entrez.efetch( db=db,retmode=retmode,rettype=rettype,retstart=start,retmax=batchSize,webenv=webenv,query_key=query_key )
  #sys.stdout.write( handle.read() )        
  output_handle.write(handle.read())        
output_handle.close()       
print "Saved"

The script works great but after retrieving 4000/186000 results I get this error, which I do not understand:

Traceback (most recent call last): File "script_epost_fetch_sequences.py", line 41, in <module> output_handle.write(handle.read()) File "/usr/lib64/python2.7/socket.py", line 351, in read data = self._sock.recv(rbufsize)

File "/usr/lib64/python2.7/httplib.py", line 578, in read return self._read_chunked(amt)

File "/usr/lib64/python2.7/httplib.py", line 632, in _read_chunked raise IncompleteRead(''.join(value))

httplib.IncompleteRead: IncompleteRead(2450 bytes read)

Can anyone help me with this error?

ADD COMMENT
0
Entering edit mode

Could be that your connection timed-out, when I've tried to download large data sets from ncbi via efetch, it often doesn't complete (presumably becuae ncbi gets so busy) - this post seems a little similar - HTTP Error 502-Biopython-Entrez Files

ADD REPLY
0
Entering edit mode

Thanks Tonor! If that is the case then how can I prevent the connection to time-out? If I divide the entire giList into smaller lists and search and retrieve those lists one by one into separate files, do you think that that would do the job and solve the time-out problem?

ADD REPLY
0
Entering edit mode

Did the code work for you? Because for me, the code is unable to retrieve sequences after 3000 accession ids.

ADD REPLY

Login before adding your answer.

Traffic: 1953 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6