Hi, I'm working on a project that involves downloading over a million SARS-COV-2 sequences from NCBI. As this will eventually be an open source project, I'm trying to code as many steps as I can for repeatability. Currently, I'm stuck trying to use Biopython's Entrez tools, Esearch, Epost, and Efetch to download complete sequences in fasta format.
My code so far is as follows (parts with help from this Stack Overflow answer):
from urllib.error import HTTPError from Bio import Entrez import time Entrez.api_key = "<censored>" Entrez.email = "<censored>" db = "nuccore" query = "txid2697049[organism:exp] AND biomol_genomic[prop] AND viruses[filter] AND 'USA'[Text Word] AND 'complete sequence'[Text Word]" handle = Entrez.esearch(db=db, term=query) record = Entrez.read(handle) count = int(record['Count']) handle = Entrez.esearch(db=db, term=query, retmax=count, usehistory="y") record = Entrez.read(handle) id_list = record['IdList'] webenv = record['WebEnv'] batch_size = 3 for start in range(0, count, batch_size): end = min(count, start+batch_size) print("Going to post accession numbers %i to %i" % (start+1, end)) attempt = 0 success = False while attempt < 3 and not success: attempt += 1 post_xml = Entrez.epost(db, webenv=webenv, id=",".join(id_list)) success = True search_results = Entrez.read(post_xml) webenv = search_results["WebEnv"] query_key = search_results["QueryKey"] batch_size = 2 out_handle = open("sarscov2.txt", "w") for start in range(0, count, batch_size): end = min(count, start+batch_size) print("Going to download record %i to %i" % (start+1, end)) attempt = 0 success = False while attempt < 3 and not success: attempt += 1 try: fetch_handle = Entrez.efetch(db=db, rettype="fasta", retstart=start, retmax=batch_size, webenv=webenv, query_key=query_key) success = True time.sleep(10) except HTTPError as err: if 500 <= err.code <= 599: print("Received error from server %s" % err) print("Attempt %i of 3" % attempt) time.sleep(15) else: raise data = fetch_handle.read() fetch_handle.close() out_handle.write(data) out_handle.close()
I'm repeatedly getting a
HTTP 504: Gateway Timeout error when trying to run the
epost line. I think this is because I'm sending too many requests, but I'm not sure how to go about fixing this. Could anyone point me in the right direction? Thank you!