Question

Edit distance under updates

2

Entering edit mode

9.4 years ago

Chao Xu ▴ 20

Consider I have two strings A and B. I can compute EditDistance(A,B) in O(nm) time where n is the length of A and m is the length of B. Here edit distance is defined with some positive real weights matrix between pairs of characters and gap characters.

Now, assume I make a sequence of updates to A and B, such that each update is either add new character or delete a character in either end of the strings. I want to be able to answer EditDistance(A,B) every time I make an update.

For example, we might start with string abcde and acd, and we do the following operations (for simplicity, here we Levenshtein distance for edit distance, but we can use weight matrix instead...)

EditDistance(abcde,acd) [beginning] output 2 (delete b, delete e)
EditDistance(bcde,acd) [Removed the first character from the first string] output 2 (replace b with a, delete e)
EditDistance(bcde,acdg) [Added g to the end of the second string] output 2 (replace b with a, replace e with g)
EditDistance(xbcde,acdg) [Added x to the beginning of the first string] output 3 (delete x, replace b with a, replace e with g)
...

Naively, I can recompute the entire table whenever I make an update to either of the string. But I'm hoping for something that takes O(n+m) amortized time per update. (where n and m are the size of the string at that time of the update).

Basically what I want is, assume I already computed some edit distance, then someone tells me "the data I sent you is corrupted, there is another 'c' in the beginning of the first string", then I don't have to recompute the entire edit distance from scratch.

It is known that if we want some encoding of the complete edit distance table, then this paper gives an O(c(m+n)) time algorithm per update, where c is the maximum integer weight(it doesn't work for general real weights). In the paper they also pointed to an algorithm that takes O(n log m) time per update(which works for general real weights).

alignment • 1.7k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by Chao Xu ▴ 20

1

Entering edit mode

You might have better luck with this on stack overflow. While you might eventually end up with the same answer from both communities, I have a feeling that the more CS-centric community over there will give it to you faster.

ADD REPLY • link 9.4 years ago by Devon Ryan 104k

0

Entering edit mode

I'm amazed how the question walks the thin line between pure CS and bioinformatics!

ADD REPLY • link 9.4 years ago by Ram 43k

0

Entering edit mode

Could you maybe calculate the difference in Hamming distances? It would be O(mn) though.

ADD REPLY • link 9.4 years ago by Ram 43k

Ram · Answer 1 · 2014-11-16

0

Entering edit mode

9.4 years ago

Michael 54k

If the only location where the two strings can change are the ends, the maximum number of comparisons per update of each string pair is 2, +2 is also the maximum difference in edit distance (two ends to compare, both different, both equal, one equal one different) unless you compare strings of different length, but this case is not covered by your definition of edit distance. If you want to cover this case you need to introduce 'gap costs' for the ends. In my opinion, edit distance without gap costs doesn't make much sense in a biological context.

I think that is obvious from the problem, or do I misunderstand your question? Runtime complexity is O(n) for n sequences for each update.

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by Michael 54k

0

Entering edit mode

I updated the question to make it more clear. The edit distance is different from my update to the strings.

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by Chao Xu ▴ 20