Approximate String Matching

Let P be a pattern string and T a text string over the same alphabet. The edit distance between P and T is the smallest number of changes sufficient to transform a substring of T into P, where the changes may be:

Substitution -- two corresponding characters may differ: KAT -> CAT
Insertion -- add a character to T that is in P: CT -> CAT
Deletion -- delete from T a character that is not in P: CAAT -> CAT

Consider the following recurrence relation.

Let D[i,j] be the minimum number of differences between P1, P2, P3, . . . Pi and the segment of T ending at j. D[i,j] is the minimum of three possible way to extend smaller strings:

If (Pi = Tj), then D[i-1, j-1], else D[i-1, j-1] + 1. We either match or substitute the i-th and j-th characters, depending upon whether they do or do not match.
D[i-1,j] + 1. There is an extra character in the pattern to account for, so we do not advance the text pointer and pay the cost of an insertion.
D[i,j+1] + 1. There is an extra character in the text to remove, so we do not advance the pattern pointer and pay the cost of a deletion.

Boundary conditions

D[0,i] will correspond to the cost of matching the first i characters of the text with none of the pattern. This value depends upon what you want to compute. For matching the entire pattern against the entire text, this means that we must delete the first i characters of the text, so D[0,i] = i to pay the cost of the deletions. If we want to find where the pattern occurs in a long text, it should not cost more if the matched pattern starts far into the text than it is near the front. Therefore, the starting cost should be equal for all positions, D[0,i] = 0. In both cases, D[i,0] = i, the cost of deleting the first i characters of the pattern.

b c d e f f g h i x k l
0 1 2 3 4 5 6 7 8 9 10 11 12

a 1   1 2 3 4 5 6 7 8 9 10 11 12
b 2   1 2 3 4 5 6 7 8 9 10 11 12
c 3   2 1 2 3 4 5 6 7 8 9 10 11
d 4   3 2 1 2 3 4 5 6 7 8 9 10
e 5   4 3 2 1 2 3 4 5 6 7 8 9
f 6   5 4 3 2 1 2 3 4 5 6 7 8
g 7   6 5 4 3 2 2 2 3 4 5 6 7
h 8   7 6 5 4 3 3 3 2 3 4 5 6
i 9   8 7 6 5 4 4 4 3 2 3 4 5
j 10 9 8 7 6 5 5 5 4 3 3 4 5
k 11 10 9 8 7 6 6 6 5 4 4 3 4
l 12 11 10 9 8 7 7 7 6 5 5 4 3

Figure Example of dynamic programming matrix for edit distance computation, the optimal alignment path is in bold. P = abcdefghijkl, T = bcdeffghixkl

Algorithm

The suitable algorithm is the dynamic programming, the matrix D is n by m, where n = #(P) and m = #(T).

editdistance(P,T)
for i = 0 to n do D[i,0] = i
for i = 0 to m do D[0,i] = i
for i = 1 to n do
for j = 1 to m do
D[i,j] = min( D[i-1,j-1] + matchcost(p_i, t_j), D[i-1,j] + 1, D[i,j-1] + 1)

It requires constant time to fill each cell, the total time is O(mn).

The construction the actual alignment can be made from the pattern/text and the table without auxiliary storage by walking upwards and backwards through the matrix. We can reconstruct the choice given the costs of its three neighbors and the corresponding characters. This phase takes O(n+m) time.

Notice that it is uncessary to keep all O(mn) cells to compute the cost of an alignment.