I have a large number of short patterns (5-10 "letters" or amino acids in this context). I want to search for these in a large database of "texts" (proteins). Are there any existing packages which provide efficient algorithms for this? Ideally they would be
- Handle 140k patterns and 280k texts
- Able to handle ambiguous residues such as 'J' for [IL] (this can be faked by duplicating some patterns)
- python is preferred (I'm already using BioPython, but it is missing this afaik)
I'm currently using a naive O(n^2) search, but it's taking too long. I think this problem is generally solved using keyword trees (Aho-Corasick algorithm), but I'd rather not implement that myself unless I have to.
I should have specified that for output I need both the text and the pattern which matched it. This prevents some of the otherwise great implementations (eg fgrep, ahocorasick) from solving my problem.