2
Chao Xu20 wrote:

Consider I have two strings A and B. I can compute EditDistance(A,B) in O(nm) time where n is the length of A and m is the length of B. Here edit distance is defined with some positive real weights matrix between pairs of characters and gap characters.

Now, assume I make a sequence of updates to A and B, such that each update is either add new character or delete a character in either end of the strings. I want to be able to answer EditDistance(A,B) every time I make an update.

For example, we might start with string abcde and acd, and we do the following operations (for simplicity, here we Levenshtein distance for edit distance, but we can use weight matrix instead...)

EditDistance(abcde,acd) [beginning] output 2     (delete b, delete e)

EditDistance(bcde,acd) [Removed the first character from the first string] output 2  (replace b with a, delete e)

EditDistance(bcde,acdg) [Added g to the end of the second string] output 2   (replace b with a, replace e with g)

EditDistance(xbcde,acdg) [Added x to the beginning of the first string] output 3  (delete x, replace b with a, replace e with g)

...

Naively, I can recompute the entire table whenever I make an update to either of the string. But I'm hoping for something that takes O(n+m) amortized time per update. (where n and m are the size of the string at that time of the update).

Basically what I want is, assume I already computed some edit distance, then someone tells me "the data I sent you is corrupted, there is another 'c' in the beginning of the first string", then I don't have to recompute the entire edit distance from scratch.

It is known that if we want some encoding of the complete edit distance table, then this paper gives an O(c(m+n)) time algorithm per update, where c is the maximum integer weight(it doesn't work for general real weights). In the paper they also pointed to an algorithm that takes O(n log m) time per update(which works for general real weights).

alignment • 1.2k views
modified 6.2 years ago • written 6.2 years ago by Chao Xu20
1

You might have better luck with this on stack overflow. While you might eventually end up with the same answer from both communities, I have a feeling that the more CS-centric community over there will give it to you faster.

I'm amazed how the question walks the thin line between pure CS and bioinformatics!

Could you maybe calculate the difference in Hamming distances? It would be O(mn) though.

0
Michael Dondrup48k wrote:

If the only location where the two strings can change are the ends, the maximum number of comparisons per update of each string pair is 2, +2 is also the maximum difference in edit distance (two ends to compare, both different, both equal, one equal one different) unless you compare strings of different length, but this case is not covered by your definition of edit distance. If you want to cover this case you need to introduce 'gap costs' for the ends. In my opinion, edit distance without gap costs doesn't make much sense in a biological context.

I think that is obvious from the problem, or do I misunderstand your question? Runtime complexity is O(n) for n sequences for each update.

I updated the question to make it more clear. The edit distance is different from my update to the strings.