Efficient implementation of Fuzzy set operations on sets of sequences?
1
1
Entering edit mode
8.1 years ago
m.aljaff90 ▴ 10

Hi! In need of some assistance.

Let A and B be two sets of sequences where all are, say 30bp long. So my problem is that not only do I want i) the intersection of A and B, i.e. all sequences which are both in A and also in B. Let's call the this set C. Additionally, I aslo require ii) FOR EACH found sequence S in C, I would also like all the sequences in (A OR B) which have a Levenshtein distance less than x from S.

Do you possible know of an efficient way to do this? Mind you that in my case, A and B are huge (>10 m). So far I've approached this problem by creating two tables in MYSQL (one for A and B) and doing some set operations there. However, the "fuzzy/mismatch" aspect of the problem makes it run very very very slow. I really would appreciate any directions.

Thanks in advance.

fuzzy mismatch • 1.4k views
ADD COMMENT
2
Entering edit mode
8.1 years ago
Michael 54k

What about using agrep (bitap algorithm) for the fuzzy matching step?

If the search is eventually repeated, an indexing based method like suffix arrays might become an advantage, but I am not sure if they would be applicable to your case. You have many small text bits to search, while indexing is designed for a large text to be searched by a shorter pattern.

ADD COMMENT

Login before adding your answer.

Traffic: 3212 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6