Sequence Alignment
Define similarity between
two strings
1970 Needleman and Wunsch (Journal of Molecular Biology, 48
(1970):443-453) defined "edit distance"
X is x1, x2, ... xm Y is y1, y2,
... yn
the sets {1, 2,.. m} and {1, 2, .. n} representing positions in X and Y.
A matching M is an "alignment"
if (i, j), (i', j') is-in M and i < i' then j <
j' (no crossing pairs).
Similarity is based on the optimal
alignment between X and Y. Let M be a given alignment of
X and Y.
- let del > 0 be a gap penalty for each position of X or Y that
is not matched in M.
- for each pair of letter p, q, there is a mismatch cost of
alpha(p,q) for lining up p and q.
- cost is the sum of its gap and mismatch costs.
We want to compute an alignment of minimum cost.
Algorithm
Observe
- in the optimal alignment M, either (m, n) is-in M or it is
not. (either the last symbols in the two strings are matched or
they are not)
- let M be an alignment of X and Y. If (m, n) is-not-in M,
then either the m-th position of X or the n-th position of Y is not
matched in M.
We can write a recurrence from the above observations:
in an optimal alignment M, one of the following is true:
- (m , n) is-in M; or
- m-th position of X is not matched; or
- n-th position of Y is not matched.
Let OPT(i, j) denote the minimum cost of an alignment.
for i >=1, j >= 1
OPT(i,
j) = min{ alpha(x_i, y_j) + OPT{i-1, j-1), del + OPT(i-1, j), del +
OPT(i, j-1) } ... (1)
the algorithm to computer the value of an optimal alignment (source code in Som)
align
X Y
OPT[0..m, 0..n]
OPT[i, 0] = i*del for each
i
OPT[0, j] = j*del for each
j
for i = 1.. m
for j = 1.. n
use the recurrence (1) to compute OPT[i, j]
return OPT[m, n]
The running time is O(mn) and the space is also O(mn) (the size
of the array OPT[.]).
To construct the alignment itself, one can trace back from the final
value in OPT[.].
//
reconstruct the alignment from OPT[.]
to trace i j | x y d v =
v = OPT[index i j]
print i space
print j nl
if or (i==0) (j==0) break
d = (alpha i j) +
OPT[index (i-1) (j-1)]
x = del + OPT[index (i-1)
j]
y = del + OPT[index i
(j-1)]
if v ==
d trace (i-1) (j-1)
else if v == x trace
(i-1) j
else if v == y trace
i (j-1)
For a long sequence (such as in Genome alignment which the length may
be 100,000), O(mn) is not practical. There is a space efficient algorithm to
compute the value of optimal
alignment (but there is no way to reconstruct the
alignment). The key idea is that the array OPT[.] does not have
to be m x n as (1) needs only the information from the current
column and the previous column. Therefore we can "collapse"
OPT[.] into a linear space of two columns m x 2, B[0..m, 0..1].
//
space efficient alignment
// B[0..M, 0..1] collapse column
of OPT[.]
// B[i,0] =
OPT[i,j-1] B[i,1] = OPT[i,j]
to align2 | i j =
B = array ((M+1)*2)
for i 1 M
B[index i 0] =
i*del
for j 1 N
B[index 0 j] =
j*del
for i 1 M
B[index i j] =
min ((alpha i j) + B[index (i-1) 0])
(del + B[index (i-1) 1])
(del + B[index i 0])
for i 1 M
B[index i 0] = B[index i 1]
The running is O(mn) but the space is O(m).
Sequence Alignment in linear space
To reconstruct the alignment in linear space, an ingenious algorithm
exploited divide-and-conquer design, this alorithm is invented by
Hirschberg (D. Hirschberg, A linear space algorithm for computing
maximal common subsequences, Communication of the ACM 18(1975):341-343.)
We use f(i, j) to denote the length of the shortest path from (0, 0) to
(i, j) in OPT[.]. We need to define a backward formulation of the
dynamic program g(i, j) to be the length of the shortest path from (i,
j) to (m, n) by building it in reverse of f(i, j) starting from g(m, n)
= 0.
g(i,j)
= min{ alpha(x_i+1,y_i+1) + g(i+1,j+1), del + g(i,j+1), del +
g(i+1,j) } ... (2)
Now the value of the optimal alignment passing through (i, j) is
f(i, j) + g(i, j). The algorithm that follows uses the fact that
an alignment can be splited along one string (Y), the subproblem can be
computed using space efficient alignment, f(i, n/2) + g(i, n/2).
We need to solve subproblems recursively and reuse the space.
There is only one call at a time so the total space is O(m+n).
DC-align
X Y
P is a global list
m is length of X
n is length of Y
if m <= 2 or n <= 2
align X Y
space-efficient-align X
Y[1..n/2]
backward-space-efficient-align X Y[n/2+1..n]
let q be the index
minimizing f(q,n/2) + g(q,n/2)
add (q,n/2) to P
DC-align X[1..q] Y[1..n/2]
DC-align Y[q+1..m]
Y[n/2+1..n]
return P
The running time of DC-align is O(mn).
Tools
You need to use the sequence align and
som-executable package to try the examples
End