Approximate String Matching

Let P be a pattern string and T a text string over the same alphabet.  The edit distance between P and T is the smallest number of changes sufficient to transform a substring of T into P, where the changes may be:

Consider the following recurrence relation.

Let D[i,j] be the minimum number of differences between P1, P2, P3, . . . Pi and the segment of T ending at j.  D[i,j] is the minimum of three possible way to extend smaller strings:

Boundary conditions

D[0,i] will correspond to the cost of matching the first i characters of the text with none of the pattern.  This value depends upon what you want to compute.  For matching the entire pattern against the entire text, this means that we must delete the first i characters of the text, so D[0,i] = i to pay the cost of the deletions.  If we want to find where the pattern occurs in a long text, it should not cost more if the matched pattern starts far into the text than it is near the front.  Therefore, the starting cost should be equal for all positions, D[0,i] = 0.  In both cases, D[i,0] = i, the cost of deleting the first i characters of the pattern.

      b  c  d  e  f  f  g  h  i  x  k  l
  0   1  2  3  4  5  6  7  8  9 10 11 12

a 1   1  2  3  4  5  6  7  8  9 10 11 12
b 2   1  2  3  4  5  6  7  8  9 10 11 12
c 3   2  1  2  3  4  5  6  7  8  9 10 11
d 4   3  2  1  2  3  4  5  6  7  8  9 10
e 5   4  3  2  1  2  3  4  5  6  7  8  9
f 6   5  4  3  2  1  2  3  4  5  6  7  8
g 7   6  5  4  3  2  2  2  3  4  5  6  7
h 8   7  6  5  4  3  3  3  2  3  4  5  6
i 9   8  7  6  5  4  4  4  3  2  3  4  5
j 10  9  8  7  6  5  5  5  4  3  3  4  5
k 11 10  9  8  7  6  6  6  5  4  4  3  4
l 12 11 10  9  8  7  7  7  6  5  5  4  3

Figure  Example of dynamic programming matrix for edit distance computation, the optimal alignment path is in bold.  P = abcdefghijkl,  T = bcdeffghixkl

Algorithm

The suitable algorithm is the dynamic programming, the matrix D is n by m, where n = #(P) and m = #(T).

editdistance(P,T)
  for i = 0 to n do D[i,0] = i
  for i = 0 to m do D[0,i] = i
  for i = 1 to n do
    for j = 1 to m do
      D[i,j] = min( D[i-1,j-1] + matchcost(p_i, t_j), D[i-1,j] + 1, D[i,j-1] + 1)

It requires constant time to fill each cell, the total time is O(mn).

The construction the actual alignment can be made from the pattern/text and the table without auxiliary storage by walking upwards and backwards through the matrix.  We can reconstruct the choice given the costs of its three neighbors and the corresponding characters.  This phase takes O(n+m) time.

Notice that it is uncessary to keep all O(mn) cells to compute the cost of an alignment.