Question

Maximum returned records using epost?

0

Entering edit mode

5 months ago

theclubstyle ▴ 40

Hi all,

I'm putting something together to return metadata for entries in the nucleotide database using e-utilities via biopython with the following function:

def fetch_nucleotide_info(query_list):
    # Convert query list to comma-sep string
    query_str = ','.join(query_list)
    print(query_list)

    # Submit list to epost
    post_handle = Entrez.epost(db="nucleotide", id=query_str)
    post_results = Entrez.read(post_handle)
    post_handle.close()

    webenv = post_results["WebEnv"]
    query_key = post_results["QueryKey"]

    # Get summaries from esummary
    summary_handle = Entrez.esummary(db="nucleotide", webenv=webenv, query_key=query_key, version="2.0", retmode="json")
    summary_data = json.load(summary_handle)
    uids = summary_data['result']['uids']
    for uid in uids:
        notGI = summary_data['result'][uid].get('caption','')
        description = summary_data['result'][uid].get('title','')
        created = summary_data['result'][uid].get('createdate','')
        updated = summary_data['result'][uid].get('updatedate','')
        subtype = summary_data['result'][uid].get('subtype','')
        subname = summary_data['result'][uid].get('subname','')

The problem I'm having is that the NCBI server is intermittently giving an error:

  File "/Users/runner/miniforge3/conda-bld/python-split_1703348537777/work/Modules/pyexpat.c", line 461, in EndElement
  File "/Users/me/opt/miniconda3/lib/python3.9/site-packages/Bio/Entrez/Parser.py", line 790, in endErrorElementHandler
    raise RuntimeError(data)
RuntimeError: Some IDs have invalid value and were omitted. Maximum ID value 18446744073709551615

This is especially where the list to look up is larger than around 100 entries, but also often works without error using the exact same command a few seconds later, so it seems this is an issue with NCBI request server overload. As ideally this is going to be a scheduled process, it'd be better for it to run without error and without much interaction. The best workaround I can think of is to randomly subsample large lists to less than 100, but that's not ideal. Is there a more robust way of using e-utilities to search larger query lists?

E-utilities biopython • 656 views

ADD COMMENT • link updated 4 months ago by GenoMax 144k • written 5 months ago by theclubstyle ▴ 40

score 0 · Answer 1 · 2024-02-22

0

Entering edit mode

5 months ago

Pierre Lindenbaum 163k

https://www.ncbi.nlm.nih.gov/books/NBK25499/

creasing retmax allows more of the retrieved UIDs to be included in the XML output, up to a maximum of 10,000 records.

ADD COMMENT • link 5 months ago by Pierre Lindenbaum 163k

0

Entering edit mode

So I tried increasing retmax to 10,000 and the error persisted, BUT it turns out that retmax for json output (which is what the code requests) is limited to 500, which largely explains this. Setting retmax to 500 stops the issue but also limits the output (as expected)

The version 2.0 xml output has a few known issues meaning the json is slightly more straightforward to parse using existing modules. So it can be completely solved but needs a bit more work.

But thanks anyway Pierre.

ADD REPLY • link 5 months ago by theclubstyle ▴ 40

score 0 · Answer 2 · 2024-03-20

An update if anyone is interested; I came up with a solution which is a little clunky, but a) works and b) is the only way I can think of to get around the problem.

Outside of the function, take the list of IDs and split into chunks of 500 (the maximum allowed for json records). For each chunk, open a while loop and attempt to run the chunk through several times before either continuing to the next chunk (if successful) or break and re-try (if not). Import sleep to give a few seconds between attempts so as not to overload the web server. So far the maximum number of retrys observed is 3 before continuing, and it's not consistent as we're dealing with server load. But it completes the job :)

# Split into chunks of n (max allowed by json format in e-utils)
chunk_size = 500
# Set number of times to repeat NCBI retrieval per chunk
max_attempts = 5

chunks = [parsed_IDs[i:i + chunk_size] for i in range(0, len(parsed_IDs), chunk_size)]
total_chunks = len(chunks)

# Run each set of 500 entries. SLEEP for 5 seconds between each iteration to avoid server crash
for chunk_index, chunk in enumerate(chunks):
    attempt_count = 0
    while attempt_count < max_attempts:
        try:
            print(f"Processing chunk {chunk_index + 1} of {total_chunks}:")
            fetch_nucleotide_info(chunk, outfile)
            time.sleep(10)
        except Exception:
            attempt_count += 1
            continue
        break