Biopython HTTPError when Fetching more than 1400 Entries from NCBI
0
0
Entering edit mode
3.0 years ago

Hello, The premise: I have alot of Genbank IDs, which i need to check for the "root" Organism. (*alot -> about 4 Million) The way i tried this so far is by using the EUtils with Biopython, first uploading my IDs to NCBI Servers with EPost. After that i try to receive the smallest XML File possible with EFetch for each ID and parse it. The Problem now is: I can only Fetch about 1400 IDs *(XMLs) from the Server. If i try to fetch more, the server does not respond. Is there a way to fix this ? Is the capability of the history server limited to 1400 IDs per session ? id_list is just a long list with Ids for testing (about 37k) , count is the length of the list

My Code:

def biopython_epost(id_list):
Entrez.email = "myMail@tum.de"
e_post = Entrez.epost (db = "nuccore", id=",".join(id_list) )
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
handle_ids={"WebEnv":webenv,"QueryKey":query_key}
return handle_ids

def biopython_efetch(handle_ids, count):
webEnv = handle_ids["WebEnv"]
queryKey = handle_ids["QueryKey"]
Entrez.email = "mymail@tum.de"
Entrez.api_key = "myAPIKEY"
batch_size = 200
yeast_hits ={}
for start in range (0, count, batch_size):
end = min(count, start+1)
fetch_handle = Entrez.efetch(db="nucleotide",
rettype="docsum",
retmode="xml",
retmax = batch_size,
retstart = start,
query_key=queryKey,
webenv=webEnv)
fetch_records = Entrez.parse(fetch_handle)
for record in fetch_records:
temp=record['Title'].split(' ')[0:4]
yeast_info = ' '.join(temp)
yeast_hits[yeast_info] = yeast_hits.get(yeast_info,0)+1
return yeast_hits


Result:

Going to download record 1 to 37755
HTTPError: HTTP Error 400: Bad Request

Biopython EUtils NCBI • 1.2k views
0
Entering edit mode

Have you signed-up for (and are using) NCBI_API_KEY? When doing a large search like this you should build in a delay so NCBI does not think you are spamming their server.

0
Entering edit mode

Yes, i just deleted the API Key from my Post thats all, but i signed in and put in the Key with the request.

0
Entering edit mode

From my experience, you don't need epost. Just join eg 100 ids, and pass them to efetch. I wouldn't put much more id's at one time.

For example I'm using this construction to fetch fasta sequences without problem:

with Entrez.efetch(db='nucleotide', id=','.join(aclist), rettype='fasta', retmode='text') as h:
....


To my knowledge, there are some limits for ncbi databases - check out: https://eutils.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html

There may be errors (you are downloading data over the internet) - you need to handle errors.

0
Entering edit mode

Thank you for that link! Still strange, as it seems that 5k should be the minimum, that i can only submit about 1400. I tried catching HTML Errors, but to no avail, it keeps failing at 1400. - Error 400 is server sided anyways, so all i can do is wait and retry. Its just that every where i look in the documentation its mentioned you should absolutely use EPost for large datasets and many requests. Efetch can fetch 200 Ids at once, but to get about 4 Million IDs in total...well, i got the weekend ahead, hope they wont ban me.

0
Entering edit mode

Hi, did you find the solution to this problem? I'm having the same issue and I'm not finding any clear answer. Thank you so much!