Using Biopython for BLAST Queries is very slow
2
0
Entering edit mode
9.9 years ago
ncowen • 0

I'm new to Biopython, and programming in general, but I am trying to create a small script that will query a large number of RNA sequences for BLAST query. Right now, I'm using Biopython and qblast, but I'm finding that during certain times of day, it takes 7-8 minutes for a single query. Is there a better way to accomplish this, other than running BLAST locally? I've been told that we would like to avoid that as much as possible.

My code currently looks something like this:

for sequence in sequences:

while True: 
    try:
        resultHandle = NCBIWWW.qblast("blastn","nr", sequence)
        if serverWasDown:
            print "Server is up and running again."
        break
    except:
        print "Server connection lost, waiting 10 seconds to try agiain.  Please make sure the computer has a working network connection."
        serverWasDown = True
        time.sleep(10)
biopython qblast python • 7.9k views
ADD COMMENT
1
Entering edit mode
9.9 years ago

A webservice may throttle or go down randomly. You should find a local installation of blast, and submit batch queries. Even for the WWW-NCBI you should be able to submit a batch query easier than a thousand individuals.

You said you don't want to have to run a local blast database, but it's really quite easy on linux, and you'll get to customize the reference archives. You could run it over your reference genome to get identical coordinate systems.

ADD COMMENT
0
Entering edit mode

I've been thinking that there must be a way to submit a batch query over NCBIWWW but I just can't find out exactly how to do that. I don't suppose you or anyone else has any ideas?

ADD REPLY
0
Entering edit mode

The NCBIWWW docs at biopython.org show a qblast function with a query_file parameter, but it runs as an HTTP GET with no provision to upload the file. Looks broken, unless the urlencoder has some hidden voodoo. So try using the query sequence like a fasta file: try querying

>1
SEQUENCE1
>2
SEQUENCE2

like a multifasta file. Beware packing the sequences into a GET request might cap out at under a kilobyte.

ADD REPLY
0
Entering edit mode

While unfortunately, I haven't been able to get that parameter to work (broken, as you say) I am instead setting up the nt and nr database locally as you suggested. I really appreciate the help!

ADD REPLY
1
Entering edit mode
9.9 years ago
Peter 6.0k

Is there a better way to accomplish this, other than running BLAST locally? Not really. Either run BLAST locally (ideally on a cluster depending what you mean by a 'large number' of queries), on someone else's system, or at the NCBI.

As an alternative to using QBLAST via Biopython, you could use the standalone BLAST+ tools with the -remote option to send the queries to the NCBI. The overall speed is likely much the same, but hopefully the BLAST+ tools would handle most of the network errors?

I would recommend using BLAST+ on your own computer/cluster, which would mean downloading the NR database.

ADD COMMENT
0
Entering edit mode

I really appreciate your help - I too came across the -remote Blast+ option, and as you said, it seemed to have similar speeds. I am instead setting up the database locally, as has been suggested. I really appreciate you taking the time to answer my question!

ADD REPLY

Login before adding your answer.

Traffic: 2623 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6