Question: Biopython HTTPError when Fetching more than 1400 Entries from NCBI
gravatar for tobias.kraft-blank
8 months ago by
tobias.kraft-blank0 wrote:

Hello, The premise: I have alot of Genbank IDs, which i need to check for the "root" Organism. (*alot -> about 4 Million) The way i tried this so far is by using the EUtils with Biopython, first uploading my IDs to NCBI Servers with EPost. After that i try to receive the smallest XML File possible with EFetch for each ID and parse it. The Problem now is: I can only Fetch about 1400 IDs *(XMLs) from the Server. If i try to fetch more, the server does not respond. Is there a way to fix this ? Is the capability of the history server limited to 1400 IDs per session ? id_list is just a long list with Ids for testing (about 37k) , count is the length of the list

My Code:

def biopython_epost(id_list): = ""
e_post = (db = "nuccore", id=",".join(id_list) )
search_results =
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
return handle_ids

def biopython_efetch(handle_ids, count):
    webEnv = handle_ids["WebEnv"]
    queryKey = handle_ids["QueryKey"] = ""
    Entrez.api_key = "myAPIKEY"
    batch_size = 200
    yeast_hits ={}
    for start in range (0, count, batch_size):
        print("Going to download record %i to %i" % (start+1, count))
        end = min(count, start+1)
        fetch_handle = Entrez.efetch(db="nucleotide", 
                                     retmax = batch_size,
                                     retstart = start,
        fetch_records = Entrez.parse(fetch_handle)
        for record in fetch_records:
            temp=record['Title'].split(' ')[0:4]
            yeast_info = ' '.join(temp)
            yeast_hits[yeast_info] = yeast_hits.get(yeast_info,0)+1
    return yeast_hits


Going to download record 1 to 37755
Going to download record 201 to 37755
Going to download record 401 to 37755
Going to download record 601 to 37755
Going to download record 801 to 37755
Going to download record 1001 to 37755
Going to download record 1201 to 37755
Going to download record 1401 to 37755
HTTPError: HTTP Error 400: Bad Request
eutils biopython ncbi • 262 views
ADD COMMENTlink written 8 months ago by tobias.kraft-blank0

Have you signed-up for (and are using) NCBI_API_KEY? When doing a large search like this you should build in a delay so NCBI does not think you are spamming their server.

ADD REPLYlink written 8 months ago by genomax91k

Yes, i just deleted the API Key from my Post thats all, but i signed in and put in the Key with the request.

ADD REPLYlink written 8 months ago by tobias.kraft-blank0

From my experience, you don't need epost. Just join eg 100 ids, and pass them to efetch. I wouldn't put much more id's at one time.

For example I'm using this construction to fetch fasta sequences without problem:

with Entrez.efetch(db='nucleotide', id=','.join(aclist), rettype='fasta', retmode='text') as h:

To my knowledge, there are some limits for ncbi databases - check out:

There may be errors (you are downloading data over the internet) - you need to handle errors.

ADD REPLYlink modified 8 months ago • written 8 months ago by massa.kassa.sc3na270

Thank you for that link! Still strange, as it seems that 5k should be the minimum, that i can only submit about 1400. I tried catching HTML Errors, but to no avail, it keeps failing at 1400. - Error 400 is server sided anyways, so all i can do is wait and retry. Its just that every where i look in the documentation its mentioned you should absolutely use EPost for large datasets and many requests. Efetch can fetch 200 Ids at once, but to get about 4 Million IDs in total...well, i got the weekend ahead, hope they wont ban me.

ADD REPLYlink written 8 months ago by tobias.kraft-blank0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1016 users visited in the last hour