Question: Retrieving Fasta Sequences From Ncbi Using Biopython
3
gravatar for Ankur
6.0 years ago by
Ankur30
Ankur30 wrote:

I have a file with GI numbers and would like to get FASTA sequences from ncbi.

from Bio import Entrez
import time
Entrez.email ="eigtw59tyjrt403@gmail.com"
f = open("C:\\bioinformatics\\gilist.txt")
for line in iter(f):
    handle = Entrez.efetch(db="nucleotide", id=line, retmode="xml")
    records = Entrez.read(handle)
    print ">GI "+line.rstrip()+" "+records[0]["GBSeq_primary-accession"]+" "+records[0]["GBSeq_definition"]+"\n"+records[0]["GBSeq_sequence"]
    time.sleep(1) # to make sure not many requests go per second to ncbi
f.close()

This script runs fine but I suddenly get this error message after a few sequences.

Traceback (most recent call last):
  File "C:/Users/Ankur/PycharmProjects/ncbiseq/getncbiSeq.py", line 7, in <module>
    handle = Entrez.efetch(db="nucleotide", id=line, retmode="xml")
  File "C:\Python27\lib\site-packages\Bio\Entrez\__init__.py", line 139, in efetch
    return _open(cgi, variables)
  File "C:\Python27\lib\site-packages\Bio\Entrez\__init__.py", line 455, in _open
    raise exception
urllib2.HTTPError: HTTP Error 500: Internal Server Error

Of course I can use http://www.ncbi.nlm.nih.gov/sites/batchentrez but I am trying to create a pipeline and would like something automated.

How can I prevent ncbi from "kicking me out"

ncbi python biopython entrez • 16k views
ADD COMMENTlink modified 6.0 years ago by Martombo2.4k • written 6.0 years ago by Ankur30
3
gravatar for Jeroen Van Goey
6.0 years ago by
Jeroen Van Goey2.2k
Ghent, Belgium
Jeroen Van Goey2.2k wrote:

Are you following the Entrez Usage Guidlines? Specifically:

  • For any series of more than 100 requests, do this at weekends or outside USA peak times. This is up to you to obey.
  • Make no more than three requests every seconds (relaxed from at most one request every three seconds in early 2009). This is automatically enforced by Biopython.

Also, read the paragraph about Minimizing the Number of Requests:

If a task requires searching for and/or downloading a large number of records, it is much more efficient to use the Entrez History to upload and/or retrieve these records in batches rather than using separate requests for each record. [...] Many thousands of IDs can be uploaded using a single EPost request, and several hundred records can be downloaded using one EFetch request.

Also some unsolicited Python advice:

Use the with statement to open a file, then you don't have to explicitly close it. And a file object is already an iterable, so you don't need to wrap it in iter:

with open("C:\\bioinformatics\\gilist.txt") as f:
    for line in f:
        # do your stuff
ADD COMMENTlink modified 6.0 years ago • written 6.0 years ago by Jeroen Van Goey2.2k
2
gravatar for Hayssam
6.0 years ago by
Hayssam270
France
Hayssam270 wrote:

I second Jeroen's suggestions, here's some code to get you started with the batch mode:


from Bio import Entrez
import time
Entrez.email ="eigtw59tyjrt403@gmail.com"
## We instead upload the list of ID beforehand 
gis=[166706892,431822405,431822402]
request = Entrez.epost("nucleotide",id=",".join(map(str,gis)))
result = Entrez.read(request)
webEnv = result["WebEnv"]
queryKey = result["QueryKey"]
handle = Entrez.efetch(db="nucleotide",retmode="xml", webenv=webEnv, query_key=queryKey)
for r in Entrez.parse(handle):
    # Grab the GI 
    try:
        gi=int([x for x in r['GBSeq_other-seqids'] if "gi" in x][0].split("|")[1])
    except ValueError:
        gi=None
    print ">GI ",gi," "+r["GBSeq_primary-accession"]+" "+r["GBSeq_definition"]+"\n"+r["GBSeq_sequence"][0:20]


Note that there might be a limit on the number of sequences you can retrieve in a batch. You should circumvent this by splitting your input set of ids in subsets of e.g. 200 ids.

ADD COMMENTlink modified 6.0 years ago • written 6.0 years ago by Hayssam270
2
gravatar for Martombo
6.0 years ago by
Martombo2.4k
Seville, ES
Martombo2.4k wrote:

my suggestion is to directly handle the exception and wait a longer time if you get "kicked out":


from Bio import Entrez
from urllib2 import HTTPError
import time
Entrez.email ="eigtw59tyjrt403@gmail.com"
f = open("C:\\bioinformatics\\gilist.txt")
for line in iter(f):
    try:
        handle = Entrez.efetch(db="nucleotide", id=line, retmode="xml")
    except HTTPError:
        time.sleep(20)
        handle = Entrez.efetch(db="nucleotide", id=line, retmode="xml")
    records = Entrez.read(handle)
    print ">GI "+line.rstrip()+" "+records[0]["GBSeq_primary-accession"]+" "+records[0]["GBSeq_definition"]+"\n"+records[0]["GBSeq_sequence"]
    time.sleep(1) # to make sure not many requests go per second to ncbi
f.close()

I tried with 20 seconds and it's working for me, if such a long time is not suitable for you, you can try to minimize it

ADD COMMENTlink written 6.0 years ago by Martombo2.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 845 users visited in the last hour