Question

Speed Of Efetch In Biopython

1

Entering edit mode

14.2 years ago

dustar1986 ▴ 380

Hi everyone,

I have a file with about 77,000 3'-utr region and I used Entrez.efetch to get the sequence of each region. I find the speed is slow (about 0.5 sec to get 1 sequence).

My code is like:

from Bio import Entrez, SeqIO
from Bio.SeqRecord import SeqRecord

f=open("utr3hg19.txt","r")        # open the file contains all human 3'-utr coordinates
                                  # each line contains information of one 3'-utr
                                  # column 2,3,4,5 represent gi, strand, start, end
                                  # split by tab
data=f.readlines()
f.close()

i=1                               # skip the first line 
f=open("utr3.fasta","w")          # sequences will be written into this file
while i<len(data):
    temp=data[i].split("\t")
    Entrez.email = "A.N.Other@example.com"
    handle = Entrez.efetch(db="nucleotide",
                          id=temp[1],           
                          rettype="fasta",
                          strand=temp[2],
                          seq_start=int(temp[4]),
                          seq_stop=int(temp[3]))

    record = SeqIO.read(handle, "fasta")
    handle.close()
    r=SeqRecord(record.seq,data[i],"","")
    d=[]
    d.append(r)       
    SeqIO.write(d,f,"fasta")
    i+=1
f.close()

Is that due to my bad coding? Or it's a network problem...? BTW, I run the code on a 12-core linux server.

biopython sequence retrieval eutils • 6.9k views

ADD COMMENT • link updated 14.2 years ago by Leszek 4.2k • written 14.2 years ago by dustar1986 ▴ 380

Ram · Answer 1 · 2011-08-29

10

Entering edit mode

14.2 years ago

Peter 6.0k

I think trying to make 77,000 calls to EFetch is in danger of breaching the NCBI usage guidelines. Make sure you do NOT run any script making more than 100 Entrez calls during USA office hours. Otherwise you may be banned by the NCBI.

For the sake of argument, let's suppose each query is instantaneous. You are still limited to 3 queries per second, so 77,000 calls will need over 7 hours!

In this situation, I would download the hg19 chromosomes (as FASTA or even GenBank), and extract the subsequences locally. OK the initial download will take a little while, but the extract script will be MUCH faster.

ADD COMMENT • link 14.2 years ago by Peter 6.0k

0

Entering edit mode

+1 and also see this post: Batch Fetching Fasta Sequences From Bed File

ADD REPLY • link updated 6.2 years ago by Ram 45k • written 14.2 years ago by brentp 24k

0

Entering edit mode

Thanks a lot, Peter. I used to search locally. Yesterday I suddenly wanna whether this can be done by biopython from NCBI, if I meet a species whose genome is not stored locally (I'm lazy to download genome.:( )

ADD REPLY • link 14.2 years ago by dustar1986 ▴ 380

score 3 · Answer 2 · 2011-08-29

3

Entering edit mode

14.2 years ago

Leszek 4.2k

I recommend using ePost instead of eFetch. You will search for all you GIs at once, literally one query instead of 77 thousands. And then you can get your results in batch packets (100 sequence at once or more).

Once, you have downloaded all your GI entries, you have to parse them and retrieve only pieces you need (I believe you can write Biopython code yourself). I've been using this method to download all proteins from particular species, and I assure you it's very fast (several thousands sequences per minute). It strongly depends on NCBI load at given moment and you internet connection of course;)

[?]

ADD COMMENT • link 14.2 years ago by Leszek 4.2k

0

Entering edit mode

Thanks, Leszek. ePost is great. I'm quite new to biopython and should read more about its handbbook. Sorry to trouble you.

ADD REPLY • link 14.2 years ago by dustar1986 ▴ 380