Fast search for many patterns
5
2
Entering edit mode
9.9 years ago
sbliven ▴ 30

I have a large number of short patterns (5-10 "letters" or amino acids in this context). I want to search for these in a large database of "texts" (proteins). Are there any existing packages which provide efficient algorithms for this? Ideally they would be

  • Handle 140k patterns and 280k texts
  • Able to handle ambiguous residues such as 'J' for [IL] (this can be faked by duplicating some patterns)
  • python is preferred (I'm already using BioPython, but it is missing this afaik)

I'm currently using a naive O(n^2) search, but it's taking too long. I think this problem is generally solved using keyword trees (Aho-Corasick algorithm), but I'd rather not implement that myself unless I have to.

Update

I should have specified that for output I need both the text and the pattern which matched it. This prevents some of the otherwise great implementations (eg fgrep, ahocorasick) from solving my problem.

algorithm sequence patterns searching python • 4.7k views
ADD COMMENT
0
Entering edit mode

There seems to be a python implementation of the Aho-Corasick algorithm or esmre. Have you tried it? (I discovered this algorithm just now, interesting!)

ADD REPLY
0
Entering edit mode

@dariober Thanks! I'll try those out. Apparently I should google my problems after I write a question, as I didn't think about Aho-Corasick until I had drafted the question. You should add an answer so I can accept it (if it works).

ADD REPLY
0
Entering edit mode

This algorithm is also implemented in GNU fgrep btw.

ADD REPLY
1
Entering edit mode
9.9 years ago
sbliven ▴ 30

I ended up using the esmre library, as suggested by dariober in the comments. It dropped my run time from hours (with a brute force search) to seconds.

Full code is on github, but the important parts are quite simple:

import esmre

vocab = esmre.Index()
for word in patterns:
  vocab.enter(word,word)
for text in texts:
  matches = vocab.query(text)
  for match in matches:
    print "Match for %s in %s" % (match, text)
ADD COMMENT
2
Entering edit mode
9.9 years ago
Dan D 7.4k

As Michael Dondrup suggested, a quick solution might be to use fgrep.

Put your patterns in a text file, one per line.

Construct your fgrep command like this to search a large group of text files:

fgrep -f patterns_file.txt /path/to/texts

These options will probably be of interest

  • -r recurse into subdirectories
  • -l print names only of files with matches
  • -H print the file name along with the match
  • --include/--exclude use this to filter out files you don't want to search
ADD COMMENT
0
Entering edit mode

also check the -F flag if your search patterns are fixed. This will speed up the processes a lot!

ADD REPLY
0
Entering edit mode

In my experience, f/grepping large files against large files is extremely slow way to do things, even if you parallelize with gnu parallel or something similar. If you're working in bash, there's usually a orders of magnitude faster solution that involves sort and comm..

ADD REPLY
0
Entering edit mode

Excellent suggestion, and very fast! It takes less than a second on my system. However, I need to know which patterns match to each text (nearly everything matches some pattern). This does not seem to be an option with fgrep.

ADD REPLY
1
Entering edit mode
9.9 years ago
brentp 24k

Maybe look into BKTree's, which work for approximate string matching. There are a number of python implementations.

http://en.wikipedia.org/wiki/BK-tree

ADD COMMENT
0
Entering edit mode
9.9 years ago
Phil S. ▴ 700

You have to decide if it is fast enough or not (I scan sequencing data for adapters and barcodes with it) but you might wanna check out this:

awk 'NR==FNR{strings[$0]; next} {for (string in strings) if ( (idx = index($0,string)) > 0 ) print string, FNR, idx }' file1 file2

What it does is it builds an index of strings given in file1 (1 string per line ) and checks for those strings in file2. Reporting is on std.out

something like this:

STRING:LINE_NUMBER:POSITION
.

Maybe this helps.

Best Phil

ADD COMMENT
0
Entering edit mode
9.9 years ago
pld 5.1k

Cant you just use python's subprocess module to make calls to fgrep? This is generally what I end up doing anytime a system command doesn't have enough functionality.

You could make a dictionary and key on patterns and then store text returned from fgrep patterns for each key as values under that key.

ADD COMMENT

Login before adding your answer.

Traffic: 1590 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6