I have a large number of short patterns (5-10 "letters" or amino acids in this context). I want to search for these in a large database of "texts" (proteins). Are there any existing packages which provide efficient algorithms for this? Ideally they would be
- Handle 140k patterns and 280k texts
- Able to handle ambiguous residues such as 'J' for [IL] (this can be faked by duplicating some patterns)
- python is preferred (I'm already using BioPython, but it is missing this afaik)
I'm currently using a naive O(n^2) search, but it's taking too long. I think this problem is generally solved using keyword trees (Aho-Corasick algorithm), but I'd rather not implement that myself unless I have to.
I should have specified that for output I need both the text and the pattern which matched it. This prevents some of the otherwise great implementations (eg fgrep, ahocorasick) from solving my problem.
There seems to be a python implementation of the Aho-Corasick algorithm or esmre. Have you tried it? (I discovered this algorithm just now, interesting!)
@dariober Thanks! I'll try those out. Apparently I should google my problems after I write a question, as I didn't think about Aho-Corasick until I had drafted the question. You should add an answer so I can accept it (if it works).
This algorithm is also implemented in GNU fgrep btw.