Fast search for many patterns
5
2
Entering edit mode
8.7 years ago
sbliven ▴ 30

I have a large number of short patterns (5-10 "letters" or amino acids in this context). I want to search for these in a large database of "texts" (proteins). Are there any existing packages which provide efficient algorithms for this? Ideally they would be

• Handle 140k patterns and 280k texts
• Able to handle ambiguous residues such as 'J' for [IL] (this can be faked by duplicating some patterns)
• python is preferred (I'm already using BioPython, but it is missing this afaik)

I'm currently using a naive O(n^2) search, but it's taking too long. I think this problem is generally solved using keyword trees (Aho-Corasick algorithm), but I'd rather not implement that myself unless I have to.

Update

I should have specified that for output I need both the text and the pattern which matched it. This prevents some of the otherwise great implementations (eg fgrep, ahocorasick) from solving my problem.

algorithm sequence patterns searching python • 4.0k views
0
Entering edit mode

There seems to be a python implementation of the Aho-Corasick algorithm or esmre. Have you tried it? (I discovered this algorithm just now, interesting!)

0
Entering edit mode

@dariober Thanks! I'll try those out. Apparently I should google my problems after I write a question, as I didn't think about Aho-Corasick until I had drafted the question. You should add an answer so I can accept it (if it works).

0
Entering edit mode

This algorithm is also implemented in GNU fgrep btw.

1
Entering edit mode
8.7 years ago
sbliven ▴ 30

I ended up using the esmre library, as suggested by dariober in the comments. It dropped my run time from hours (with a brute force search) to seconds.

Full code is on github, but the important parts are quite simple:

import esmre

vocab = esmre.Index()
for word in patterns:
vocab.enter(word,word)
for text in texts:
matches = vocab.query(text)
for match in matches:
print "Match for %s in %s" % (match, text)

2
Entering edit mode
8.7 years ago
Dan D 7.4k

As Michael Dondrup suggested, a quick solution might be to use fgrep.

Put your patterns in a text file, one per line.

Construct your fgrep command like this to search a large group of text files:

fgrep -f patterns_file.txt /path/to/texts


These options will probably be of interest

• -r recurse into subdirectories
• -l print names only of files with matches
• -H print the file name along with the match
• --include/--exclude use this to filter out files you don't want to search
0
Entering edit mode

also check the -F flag if your search patterns are fixed. This will speed up the processes a lot!

0
Entering edit mode

In my experience, f/grepping large files against large files is extremely slow way to do things, even if you parallelize with gnu parallel or something similar. If you're working in bash, there's usually a orders of magnitude faster solution that involves sort and comm..

0
Entering edit mode

Excellent suggestion, and very fast! It takes less than a second on my system. However, I need to know which patterns match to each text (nearly everything matches some pattern). This does not seem to be an option with fgrep.

1
Entering edit mode
8.7 years ago
brentp 24k

Maybe look into BKTree's, which work for approximate string matching. There are a number of python implementations.

http://en.wikipedia.org/wiki/BK-tree

0
Entering edit mode
8.7 years ago
Phil S. ▴ 700

You have to decide if it is fast enough or not (I scan sequencing data for adapters and barcodes with it) but you might wanna check out this:

awk 'NR==FNR{strings[$0]; next} {for (string in strings) if ( (idx = index($0,string)) > 0 ) print string, FNR, idx }' file1 file2


What it does is it builds an index of strings given in file1 (1 string per line ) and checks for those strings in file2. Reporting is on std.out

something like this:

STRING:LINE_NUMBER:POSITION
.


Maybe this helps.

Best Phil

0
Entering edit mode
8.7 years ago
pld 5.0k

Cant you just use python's subprocess module to make calls to fgrep? This is generally what I end up doing anytime a system command doesn't have enough functionality.

You could make a dictionary and key on patterns and then store text returned from fgrep patterns for each key as values under that key.