Question

Biopython - AttributeError: when downloading from genbank

0

Entering edit mode

3.5 years ago

Wilber0x ▴ 50

I have a list of around 200 plastid genomes which I want to download from genbank using biopython, and then put into one .gb file.

Here is the code I am using to do this:

out_handle = open(filename, "w")
for i in range(len(genbankIDs)):
    net_handle = Entrez.efetch(
        db="nucleotide", id=genbankIDs[i], rettype="gbwithparts", retmode="text"
    )
    out_handle.write(net_handle.read())
out_handle.close()
net_handle.close()
print("Saved")

where genbankIDs is the list of accession numbers of the sequences I want to download from genbank.

However, this only works for the first 20 accessions. I get this error message:

Traceback (most recent call last):
  File "fetchFromGenbank.py", line 25, in <module>
    db="nucleotide", id=genbankIDs[i], rettype="gbwithparts", retmode="text"
  File "/opt/anaconda2/lib/python2.7/site-packages/Bio/Entrez/__init__.py", line 195, in efetch
    return _open(cgi, variables, post=post)
  File "/opt/anaconda2/lib/python2.7/site-packages/Bio/Entrez/__init__.py", line 564, in _open
    and exception.status // 100 == 4:
AttributeError: 'HTTPError' object has no attribute 'status'

How can I solve this? Is it an instance of NCBI timing out, or a problem based on me using python 2.7 and not a more recent version?

biopython software error • 941 views

ADD COMMENT • link updated 3.5 years ago by Shred ★ 1.4k • written 3.5 years ago by Wilber0x ▴ 50

1

Entering edit mode

Wilber0x : Biostars built-in SPAM protection (we need to have this in place sorry) does not allow HTTP links in title (I think it was interpreting HTTPerror in your post title as a link). I have edited that out so hopefully you post will not be automatically flagged/deleted as SPAM now. I also reinstated your account so you should be able to respond.

ADD REPLY • link 3.5 years ago by GenoMax 141k

0

Entering edit mode

Thank you for your help

ADD REPLY • link 3.5 years ago by Wilber0x ▴ 50

1

Entering edit mode

3.5 years ago

Shred ★ 1.4k

Use an Entrez API key and a time.sleep for your script. Usually 10sec is enough between each request (read more here)

ADD COMMENT • link 3.5 years ago by Shred ★ 1.4k

0

Entering edit mode

I used an Entrez API key, and have incorporated delays into my script. Whether the delay is 1.5s or 60s between requests, I still get the same error after 150 plastid genomes.

ADD REPLY • link 3.5 years ago by Wilber0x ▴ 50

1

Entering edit mode

Identify if a specific accession is causing that problem and remove it.

ADD REPLY • link 3.5 years ago by GenoMax 141k

0

Entering edit mode

This was the problem, thanks for the help!

ADD REPLY • link 3.5 years ago by Wilber0x ▴ 50

score 2 · Accepted Answer · 2020-10-27

2

Entering edit mode

3.5 years ago

GenoMax 141k

Have you signed up for NCBI_API_KEY? If not you should do that first. Since you are doing this via a script build in a delay between your queries to ensure you don't get flagged by NCBI server for sending too many queries in a short time.

ADD COMMENT • link 3.5 years ago by GenoMax 141k

0

Entering edit mode

I have built in a time delay upon your suggestion so only one query is submitted per second. This improves the number of sequences downloaded to 140, but still results in the same error message.

ADD REPLY • link 3.5 years ago by Wilber0x ▴ 50

1

Entering edit mode

Can you try increasing the delay? Since you are requesting gbwithparts that has to be executed to get you the right section. I would say try one query every 15 or 30 seconds (or longer).

ADD REPLY • link 3.5 years ago by GenoMax 141k

0

Entering edit mode

I have tried increasing the delay to 60s, and got the same number downloaded as when I used a 1.5s delay. Perhaps I should just do it in batches of 140 rather than all at once.

ADD REPLY • link 3.5 years ago by Wilber0x ▴ 50