How do I effectively automate BLAST searches without using a local database?
1
0
Entering edit mode
3.1 years ago
geosmin ▴ 20

I have snippets of protein sequences and I need to find out to which accession numbers of the nr database they belong.

So far I tried to automate this process in the following way:

  • accessing the NCBI webserver directly via the NCBIWWW function of the Bio.Blast module of Biopython
  • accessing the BLAST+ program via theNcbiblastpCommandline function of the Bio.Blast.Applications module of Biopython and using the - remote argument

But both ways basically take forever. Do any of you have an idea how I can automate this without having to download the nr database of NCBI? Or is this really the only way?

BLAST Python • 2.1k views
ADD COMMENT
2
Entering edit mode
  • ho many snippets do you have?
  • how long is "forever"?
  • does each snippet take "forever"?
  • what would you consider "fast enough"?
  • might changing parameters alter performance?
  • must you really search all of nr? what is your real application?
ADD REPLY
0
Entering edit mode

I have around 600 snippets for a phylogenetic analysis of cyanobacterial proteins. Right now I'm running my code and so far it's taking 30 min and longer to create a single file. However, the first file was created pretty fast. It seems like a single run is relatively fast, but as soon as I chain many searches in a loop each of them takes really long.

This is my code

def blast_remote(query_string, output):
    from Bio.Blast.Applications import NcbiblastpCommandline
    blastp_cline = NcbiblastpCommandline(
        db="nr", 
        evalue=0.05,
        gapextend=1, 
        gapopen=11,
        out=output+".xml", 
        outfmt=5, 
        parse_deflines=True,
        query=query_string, 
        remote=True, 
        word_size=6,
        max_target_seqs=2,
    )
    stdout, stderr = blastp_cline()

count = 1
for index, value in blast_results.iterrows():
    with open("query.fsa", "w+") as file:
        file.write(">" + value["Organism"] + "\n" + value["Sequence"])
    blast_remote("query.fsa", str(count))
    print(count, "files written.")
    count = count + 1

What parameters do you think could be changed to improve performance.

I guess otherwise I could just try downloading a subset of the nr database.Though, I'll have to figure out how.

ADD REPLY
1
Entering edit mode

be sure not to overload the NCBI servers with your requests. As Joe also pointed out if you submit too many concurrent requests you might get blacklisted by NCBI. You can avoid this to some extent by registering yourself at NCBI but even then you want be allowed to submit many requests.

ADD REPLY
0
Entering edit mode

The only thing you can do is try to adjust parameters to enable your searches to complete more quickly. Your E-value is very high for one thing (the default is 1E-6 I think), and you probably don't need to specify the word size or alignment parameters unless you have a very specific reason.

I think the NCBI polling rate is 5 queries per second or something even for guests, so you could parallelise your code to run up to 5 concurrent searches which will bring your overall run time down a fair bit but only to a point.

ADD REPLY
0
Entering edit mode

I am unfamiliar with using the python wrapper to blast+ command line. Regardless...

I question your use of a loop and expect you will get better performance overall by removing it.

If I were to call blast+ using -remote from the command line I would typically not call once for each input sequence but rather pass all input sequences in as a single multi-fasta file and expect combined results in a single output file. You might expect improved performance, since, typically a large portion of blast runtime is loading the db's index into RAM, and running multiple queries runs the risk of (probably) having to repeatedly load the same index which is a lot of io you're doing repeatedly. I say probably since the OS may cache, though I expect NCBI's servers are not caching nr (I expect this is untenable anyway giving nr's growth - caching is really only relevant to smaller blast dbs).

Another related consideration is covered https://www.ncbi.nlm.nih.gov/books/NBK279668/#usermanual.Concatenation_of_queries

I'm not sure where originally published, but BLAST (Basic Local Alignment Search Tool) Chapter 12. Hardware and Software Optimizations has some good tips along these lines.

On a side note, I question your choice to return xml results. I recommend you look into outfmt=6, and it variants, returning tabular data which is easily parsed and in my experience contains all what is needed about the HSP results for most applications.

Finally, on another related note, now knowing your application, I might question your choice of database to search. You might consider to still conduct -remote search but learn to Limiting a Search by taxonomy - (note: I don't know whether python wrapper exposes this functionality - just use command line blast if not)

ADD REPLY
1
Entering edit mode

Blast2GO may solve this issue.

ADD REPLY
1
Entering edit mode

As Lieven pointed out - there is nothing else you can really do.

By registering for an account I think you can increase the polling rate, but NCBI will set limits on the number of queries that you can send per unit time so that their network is not being hammered. No API/remote implementation will overcome the limits NCBI set.

ADD REPLY
2
Entering edit mode
3.1 years ago

If the -remote is not helping than nothing more you can do to speed it up I'm afraid. The only alternative then is indeed to download the whole DB and run the blasts locally.

ADD COMMENT
0
Entering edit mode

it is possible to limit the subset of nr searched in a way relevant to the OPs application by Limiting a Search by taxonomy

ADD REPLY

Login before adding your answer.

Traffic: 2564 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6