I am running webservice. The users can query database of 10M+ proteins by sequence similarity. However, blast performance is not enough (several minutes per query).
Can you recommend some faster alternatives? BLAT is much faster, but loading all proteins every time is not effective...
Or maybe some blastp tweaking?
I can sacrifice sensitivity, as I'm looking for very similar matches (>90% identity). It would be great, if I can retrieve protein sequences from db easily, so I don't have to store sequence twice (like fastacmd in blast).
Note, I'm bound to 1 cpu. Surprisingly, increasing word size (-W 7) didn't increase blastp performance.
In the end, I came up with my own solution based on kmers stored in MySQL and BLATing only subset of proteins. It's able to find similar (didn't tested that, but >50% are captured easily) to database of 13M sequences for single query in seconds. In contrast, BLASTp would take several minutes (12-15min), and other solutions like LAST or Vmatch didn't go below 1min.
Let me know if someone is interested in that. It's still quite simplistic, but someone may benefit :)