Approximate String Matching
Let P be a pattern string and T a text string over the same alphabet.
The edit distance between P and T is the smallest number of changes sufficient
to transform a substring of T into P, where the changes may be:
-
Substitution -- two corresponding characters may differ: KAT -> CAT
-
Insertion -- add a character to T that is in P: CT -> CAT
-
Deletion -- delete from T a character that is not in P: CAAT -> CAT
Consider the following recurrence relation.
Let D[i,j] be the minimum number of differences between P1, P2, P3, . .
. Pi and the segment of T ending at j. D[i,j] is the minimum of three
possible way to extend smaller strings:
-
If (Pi = Tj), then D[i-1, j-1], else D[i-1, j-1] + 1. We either match
or substitute the i-th and j-th characters, depending upon whether they
do or do not match.
-
D[i-1,j] + 1. There is an extra character in the pattern to account
for, so we do not advance the text pointer and pay the cost of an
insertion.
-
D[i,j+1] + 1. There is an extra character in the text to remove,
so we do not advance the pattern pointer and pay the cost of a deletion.
Boundary conditions
D[0,i] will correspond to the cost of matching the first i characters of
the text with none of the pattern. This value depends upon what you
want to compute. For matching the entire pattern against the entire
text, this means that we must delete the first i characters of the text,
so D[0,i] = i to pay the cost of the deletions. If we want to find
where the pattern occurs in a long text, it should not cost more if the
matched pattern starts far into the text than it is near the front.
Therefore, the starting cost should be equal for all positions, D[0,i]
= 0. In both cases, D[i,0] = i, the cost of deleting the first i
characters of the pattern.
b c d e f
f g h i x k l
0 1 2 3 4 5 6
7 8 9 10 11 12
a 1 1 2 3 4 5
6 7 8 9 10 11 12
b 2 1 2 3 4 5
6 7 8 9 10 11 12
c 3 2 1 2 3 4
5 6 7 8 9 10 11
d 4 3 2 1 2 3
4 5 6 7 8 9 10
e 5 4 3 2 1 2
3 4 5 6 7 8 9
f 6 5 4 3 2 1
2 3 4 5 6 7 8
g 7 6 5 4 3 2 2
2 3 4 5 6 7
h 8 7 6 5 4 3 3
3 2 3 4 5 6
i 9 8 7 6 5 4 4
4 3 2 3 4 5
j 10 9 8 7 6 5 5 5
4 3 3 4 5
k 11 10 9 8 7 6 6 6 5
4 4 3 4
l 12 11 10 9 8 7 7 7 6
5 5 4 3
Figure Example of dynamic programming matrix for edit distance
computation, the optimal alignment path is in bold. P = abcdefghijkl,
T = bcdeffghixkl
Algorithm
The suitable algorithm is the dynamic programming, the matrix D is n by
m, where n = #(P) and m = #(T).
editdistance(P,T)
for i = 0 to n do D[i,0] = i
for i = 0 to m do D[0,i] = i
for i = 1 to n do
for j = 1 to m do
D[i,j] = min( D[i-1,j-1] + matchcost(p_i,
t_j), D[i-1,j] + 1, D[i,j-1] + 1)
It requires constant time to fill each cell, the total time is O(mn).
The construction the actual alignment can be made from the pattern/text
and the table without auxiliary storage by walking upwards and backwards
through the matrix. We can reconstruct the choice given the costs
of its three neighbors and the corresponding characters. This phase
takes O(n+m) time.
Notice that it is uncessary to keep all O(mn) cells to compute the cost
of an alignment.