Question

Looking For Faster Blastp-Like Program?

3

Entering edit mode

11.2 years ago

Leszek 4.2k

I am running webservice. The users can query database of 10M+ proteins by sequence similarity. However, blast performance is not enough (several minutes per query).
Can you recommend some faster alternatives? BLAT is much faster, but loading all proteins every time is not effective...
Or maybe some blastp tweaking?
I can sacrifice sensitivity, as I'm looking for very similar matches (>90% identity). It would be great, if I can retrieve protein sequences from db easily, so I don't have to store sequence twice (like fastacmd in blast).
Note, I'm bound to 1 cpu. Surprisingly, increasing word size (-W 7) didn't increase blastp performance.

UPDATE
In the end, I came up with my own solution based on kmers stored in MySQL and BLATing only subset of proteins. It's able to find similar (didn't tested that, but >50% are captured easily) to database of 13M sequences for single query in seconds. In contrast, BLASTp would take several minutes (12-15min), and other solutions like LAST or Vmatch didn't go below 1min.
Let me know if someone is interested in that. It's still quite simplistic, but someone may benefit :)

protein webservice similarity sequence • 5.2k views

ADD COMMENT • link 10.4 years ago by Leszek 4.2k

0

Entering edit mode

Why do you need to load all proteins every time with BLAT? Since you're running a webservice, why not run a BLAT server?

ADD REPLY • link 11.2 years ago by fo3c ▴ 450

0

Entering edit mode

is it possible to run blat server for proteins?

ADD REPLY • link 11.2 years ago by Leszek 4.2k

0

Entering edit mode

I think so. Isn't that how the UCSC blat works? http://genome.ucsc.edu/FAQ/FAQblat.html#blat5

ADD REPLY • link 11.2 years ago by fo3c ▴ 450

0

Entering edit mode

yeap, but it's for DNA, not for protein... I cannot run server for aminos:/

ADD REPLY • link 11.2 years ago by Leszek 4.2k

0

Entering edit mode

You can blat amino acid sequences with the same ease on UCSC. Hence my belief that it is possible to run a protein blat server.

ADD REPLY • link 11.2 years ago by fo3c ▴ 450

0

Entering edit mode

then I will appreciate if you can suggest how to do it. I have tried gfSever (BLAT34) but cannot make it working with proteins as it requires .2bit (handle only DNA) or .nib (one sequence per file).

ADD REPLY • link 11.2 years ago by Leszek 4.2k

score 1 · Answer 1 · 2013-02-20

1

Entering edit mode

11.2 years ago

Leszek 4.2k

USEARCH is 20-250x faster than blastp. It's promising, but need to get a licence for x64 version (4Gb limit in free version:/).
lastal is extremely fast (>10x faster than usearch, but much less sensitive). Have to compile all proteins though to check performance on entire db.

ADD COMMENT • link 11.2 years ago by Leszek 4.2k

0

Entering edit mode

What is the memory usage like? When I experimented with UCLUST I found the memory usage to be prohibitive for any moderate-large analysis, but I can't remember how USEARCH is on memory.

ADD REPLY • link 11.2 years ago by SES 8.6k

0

Entering edit mode

proteins fasta file is 5.3Gb (13M+). usearch -search_local fails after loading around 66% of proteins. unfortunately, memory is hard-limited to 4Gb in free 32bit usearch licence.

ADD REPLY • link 11.2 years ago by Leszek 4.2k

0

Entering edit mode

That is unfortunate. What about Vmatch? It uses a persistent index and is much more memory efficient, and is very fast.

ADD REPLY • link 11.2 years ago by SES 8.6k

score 0 · Answer 2 · 2013-02-20

Do you have to use the full database for every query? If it is possible to select a subset of the database based on the query, that may be an simple improvement. I think other than splitting the input or the database, you may be limited in options since you are bound to 1 CPU. I will mention that you could try GPU-BLAST if you have GPUs available, but that may not be appropriate in your situation. Since I mentioned GPU-BLAST I must say that it does not offer a great increase in performance, in my experience, compared to using multiple CPUs, so I think the best thing to do would be to rethink how you perform the search or figure out how to run the jobs on multiple CPUs.

score 0 · Answer 3 · 2013-02-20

0

Entering edit mode

11.2 years ago

Sebastian Kurscheid ▴ 300

HMMER - http://hmmer.janelia.org/

ADD COMMENT • link 11.2 years ago by Sebastian Kurscheid ▴ 300

0

Entering edit mode

This is for searching sequences against a (database of) profile HMMs, or vice versa and while more sensitive, will likely be slower than blastp.

ADD REPLY • link 11.2 years ago by SES 8.6k

0

Entering edit mode

jackhmmer is included for searching protein databases with protein sequences - in my experience at least as fast as blastp and for me personally it was easier to configure it for running it under OpenMPI. Not sure how easy it is to deploy for a web server environment.