Consider I have two strings A and B. I can compute EditDistance(A,B) in O(nm) time where n is the length of A and m is the length of B. Here edit distance is defined with some positive real weights matrix between pairs of characters and gap characters.
Now, assume I make a sequence of updates to A and B, such that each update is either add new character or delete a character in either end of the strings. I want to be able to answer EditDistance(A,B) every time I make an update.
For example, we might start with string abcde
and acd
, and we do the following operations (for simplicity, here we Levenshtein distance for edit distance, but we can use weight matrix instead...)
- EditDistance(abcde,acd) [beginning] output 2 (delete b, delete e)
- EditDistance(bcde,acd) [Removed the first character from the first string] output 2 (replace b with a, delete e)
- EditDistance(bcde,acdg) [Added g to the end of the second string] output 2 (replace b with a, replace e with g)
- EditDistance(xbcde,acdg) [Added x to the beginning of the first string] output 3 (delete x, replace b with a, replace e with g)
- ...
Naively, I can recompute the entire table whenever I make an update to either of the string. But I'm hoping for something that takes O(n+m) amortized time per update. (where n and m are the size of the string at that time of the update).
Basically what I want is, assume I already computed some edit distance, then someone tells me "the data I sent you is corrupted, there is another 'c' in the beginning of the first string", then I don't have to recompute the entire edit distance from scratch.
It is known that if we want some encoding of the complete edit distance table, then this paper gives an O(c(m+n)) time algorithm per update, where c is the maximum integer weight(it doesn't work for general real weights). In the paper they also pointed to an algorithm that takes O(n log m) time per update(which works for general real weights).
You might have better luck with this on stack overflow. While you might eventually end up with the same answer from both communities, I have a feeling that the more CS-centric community over there will give it to you faster.
I'm amazed how the question walks the thin line between pure CS and bioinformatics!
Could you maybe calculate the difference in Hamming distances? It would be O(mn) though.