Question

Fast search for many patterns

2

Entering edit mode

10.5 years ago

sbliven ▴ 30

I have a large number of short patterns (5-10 "letters" or amino acids in this context). I want to search for these in a large database of "texts" (proteins). Are there any existing packages which provide efficient algorithms for this? Ideally they would be

Handle 140k patterns and 280k texts
Able to handle ambiguous residues such as 'J' for [IL] (this can be faked by duplicating some patterns)
python is preferred (I'm already using BioPython, but it is missing this afaik)

I'm currently using a naive O(n^2) search, but it's taking too long. I think this problem is generally solved using keyword trees (Aho-Corasick algorithm), but I'd rather not implement that myself unless I have to.

Update

I should have specified that for output I need both the text and the pattern which matched it. This prevents some of the otherwise great implementations (eg fgrep, ahocorasick) from solving my problem.

algorithm sequence patterns searching python • 5.0k views

ADD COMMENT • link 10.4 years ago by sbliven ▴ 30

0

Entering edit mode

There seems to be a python implementation of the Aho-Corasick algorithm or esmre. Have you tried it? (I discovered this algorithm just now, interesting!)

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.5 years ago by dariober 15k

0

Entering edit mode

@dariober Thanks! I'll try those out. Apparently I should google my problems after I write a question, as I didn't think about Aho-Corasick until I had drafted the question. You should add an answer so I can accept it (if it works).

ADD REPLY • link 10.5 years ago by sbliven ▴ 30

0

Entering edit mode

This algorithm is also implemented in GNU fgrep btw.

ADD REPLY • link 10.5 years ago by Michael 55k

2

Entering edit mode

10.5 years ago

Dan D 7.4k

As Michael Dondrup suggested, a quick solution might be to use fgrep.

Put your patterns in a text file, one per line.

Construct your fgrep command like this to search a large group of text files:

fgrep -f patterns_file.txt /path/to/texts

These options will probably be of interest

-r recurse into subdirectories
-l print names only of files with matches
-H print the file name along with the match
--include/--exclude use this to filter out files you don't want to search

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 10.5 years ago by Dan D 7.4k

0

Entering edit mode

also check the -F flag if your search patterns are fixed. This will speed up the processes a lot!

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.5 years ago by Phil S. ▴ 700

0

Entering edit mode

In my experience, f/grepping large files against large files is extremely slow way to do things, even if you parallelize with gnu parallel or something similar. If you're working in bash, there's usually a orders of magnitude faster solution that involves sort and comm..

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.5 years ago by 5heikki 11k

0

Entering edit mode

Excellent suggestion, and very fast! It takes less than a second on my system. However, I need to know which patterns match to each text (nearly everything matches some pattern). This does not seem to be an option with fgrep.

ADD REPLY • link 10.5 years ago by sbliven ▴ 30

1

Entering edit mode

10.5 years ago

brentp 24k

Maybe look into BKTree's, which work for approximate string matching. There are a number of python implementations.

http://en.wikipedia.org/wiki/BK-tree

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 10.5 years ago by brentp 24k

0

Entering edit mode

10.5 years ago

Phil S. ▴ 700

You have to decide if it is fast enough or not (I scan sequencing data for adapters and barcodes with it) but you might wanna check out this:

awk 'NR==FNR{strings[$0]; next} {for (string in strings) if ( (idx = index($0,string)) > 0 ) print string, FNR, idx }' file1 file2

What it does is it builds an index of strings given in file1 (1 string per line ) and checks for those strings in file2. Reporting is on std.out

something like this:

STRING:LINE_NUMBER:POSITION
.

Maybe this helps.

Best Phil

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 10.5 years ago by Phil S. ▴ 700

0

Entering edit mode

10.5 years ago

pld 5.1k

Cant you just use python's subprocess module to make calls to fgrep? This is generally what I end up doing anytime a system command doesn't have enough functionality.

You could make a dictionary and key on patterns and then store text returned from fgrep patterns for each key as values under that key.

ADD COMMENT • link 10.5 years ago by pld 5.1k

Ram · Accepted Answer · 2014-06-06

I ended up using the esmre library, as suggested by dariober in the comments. It dropped my run time from hours (with a brute force search) to seconds.

Full code is on github, but the important parts are quite simple:

import esmre

vocab = esmre.Index()
for word in patterns:
  vocab.enter(word,word)
for text in texts:
  matches = vocab.query(text)
  for match in matches:
    print "Match for %s in %s" % (match, text)