Esearch, Epost, and Efetch for Large Datasets in Biopython
0
0
Entering edit mode
12 months ago
Salem • 0

Hi, I'm working on a project that involves downloading over a million SARS-COV-2 sequences from NCBI. As this will eventually be an open source project, I'm trying to code as many steps as I can for repeatability. Currently, I'm stuck trying to use Biopython's Entrez tools, Esearch, Epost, and Efetch to download complete sequences in fasta format.

My code so far is as follows (parts with help from this Stack Overflow answer):

from urllib.error import HTTPError
from Bio import Entrez
import time

Entrez.api_key = "<censored>"
Entrez.email = "<censored>"

db = "nuccore"
query = "txid2697049[organism:exp] AND biomol_genomic[prop] AND viruses[filter] AND 'USA'[Text Word] AND 'complete sequence'[Text Word]"

handle = Entrez.esearch(db=db, term=query)
record = Entrez.read(handle)

count = int(record['Count'])

handle = Entrez.esearch(db=db, term=query, retmax=count, usehistory="y")
record = Entrez.read(handle)

id_list = record['IdList']
webenv = record['WebEnv']

batch_size = 3
for start in range(0, count, batch_size):
    end = min(count, start+batch_size)
    print("Going to post accession numbers %i to %i" % (start+1, end))
    attempt = 0
    success = False
    while attempt < 3 and not success:
        attempt += 1
        post_xml = Entrez.epost(db, webenv=webenv, id=",".join(id_list))
        success = True
    search_results = Entrez.read(post_xml)


webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]


batch_size = 2
out_handle = open("sarscov2.txt", "w")
for start in range(0, count, batch_size):
    end = min(count, start+batch_size)
    print("Going to download record %i to %i" % (start+1, end))
    attempt = 0
    success = False
    while attempt < 3 and not success:
        attempt += 1
        try:
            fetch_handle = Entrez.efetch(db=db, rettype="fasta",
                                         retstart=start, retmax=batch_size,
                                         webenv=webenv, query_key=query_key)
            success = True
            time.sleep(10)
        except HTTPError as err:
            if 500 <= err.code <= 599:
                print("Received error from server %s" % err)
                print("Attempt %i of 3" % attempt)
                time.sleep(15)
            else:
                raise
    data = fetch_handle.read()
    fetch_handle.close()
    out_handle.write(data)
out_handle.close()

I'm repeatedly getting a HTTP 504: Gateway Timeout error when trying to run the epost line. I think this is because I'm sending too many requests, but I'm not sure how to go about fixing this. Could anyone point me in the right direction? Thank you!

eutils biopython entrez • 675 views
ADD COMMENT
0
Entering edit mode

Instead of this your best option would be to use datasets: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/how-tos/virus/get-sars2-genomes/

Filter anything you need locally.

ADD REPLY

Login before adding your answer.

Traffic: 1203 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6