Sequence Alignment


Define similarity between two strings 
1970  Needleman and Wunsch (Journal of Molecular Biology, 48 (1970):443-453) defined "edit distance"
X   is  x1, x2, ... xm   Y is  y1, y2, ... yn
the sets {1, 2,.. m} and {1, 2, .. n} representing positions in X and Y.
A matching M is an "alignment" if  (i, j), (i', j') is-in M and i < i'  then j < j'  (no crossing pairs).

Similarity is based on the optimal alignment between X and Y.  Let M be a given alignment of X and Y.
  1. let del > 0 be a gap penalty for each position of X or Y that is not matched in M.
  2. for each pair of letter p, q, there is a mismatch cost of alpha(p,q) for lining up p and q.
  3. cost is the sum of its gap and mismatch costs.
We want to compute an alignment of minimum cost.

Algorithm

Observe
  1. in the optimal alignment M, either (m, n) is-in M or it is not.  (either the last symbols in the two strings are matched or they are not)
  2. let M be an alignment of X and Y.  If (m, n) is-not-in M, then either the m-th position of X or the n-th position of Y is not matched in M.
We can write a recurrence from the above observations:

in an optimal alignment M, one of the following is true:
  1. (m , n) is-in M; or
  2. m-th position of X is not matched; or
  3. n-th position of Y is not matched.
Let OPT(i, j) denote the minimum cost of an alignment.

for i >=1, j >= 1

OPT(i, j) = min{ alpha(x_i, y_j) + OPT{i-1, j-1), del + OPT(i-1, j), del + OPT(i, j-1) }   ... (1)

the algorithm to computer the value of an optimal alignment  (source code in Som)

align X Y
  OPT[0..m, 0..n]
  OPT[i, 0] = i*del for each i
  OPT[0, j] = j*del for each j
  for i = 1.. m
    for j = 1.. n
       use the recurrence (1) to compute OPT[i, j]
return OPT[m, n]

The running time is O(mn) and the space is also O(mn)  (the size of the array OPT[.]).
To construct the alignment itself, one can trace back from the final value in OPT[.].

// reconstruct the alignment from OPT[.]
to trace i j | x y d v =
  v = OPT[index i j]
  print i space
  print j nl
  if or (i==0) (j==0) break
  d = (alpha i j) + OPT[index (i-1) (j-1)]
  x = del + OPT[index (i-1) j]
  y = del + OPT[index i (j-1)]
  if v == d       trace (i-1) (j-1)
  else if v == x  trace (i-1) j
  else if v == y  trace i (j-1)

For a long sequence (such as in Genome alignment which the length may be 100,000), O(mn) is not practical.  There is a space efficient algorithm to compute the value of optimal alignment (but there is no way to reconstruct the alignment).  The key idea is that the array OPT[.] does not have to be m x n  as (1) needs only the information from the current column and the previous column.  Therefore we can "collapse" OPT[.] into a linear space of two columns m x 2, B[0..m, 0..1].   

// space efficient alignment
// B[0..M, 0..1] collapse column of OPT[.]
// B[i,0] = OPT[i,j-1]   B[i,1] = OPT[i,j]
to align2 | i j =
  B = array ((M+1)*2)
  for i 1 M
    B[index i 0] = i*del
  for j 1 N
    B[index 0 j] = j*del
    for i 1 M
      B[index i j] =
      min ((alpha i j) + B[index (i-1) 0])
      (del + B[index (i-1) 1])
      (del + B[index i 0])
    for i 1 M
      B[index i 0] = B[index i 1]

The running is O(mn) but the space is O(m).

Sequence Alignment in linear space

To reconstruct the alignment in linear space, an ingenious algorithm exploited divide-and-conquer design, this alorithm is invented by Hirschberg (D. Hirschberg, A linear space algorithm for computing maximal common subsequences, Communication of the ACM 18(1975):341-343.)
 
We use f(i, j) to denote the length of the shortest path from (0, 0) to (i, j) in OPT[.]. We need to define a backward formulation of the dynamic program g(i, j) to be the length of the shortest path from (i, j) to (m, n) by building it in reverse of f(i, j) starting from g(m, n) = 0.

g(i,j) = min{ alpha(x_i+1,y_i+1) + g(i+1,j+1),  del + g(i,j+1), del + g(i+1,j) }   ... (2)

Now the value of the optimal alignment passing through (i, j) is  f(i, j) + g(i, j).  The algorithm that follows uses the fact that an alignment can be splited along one string (Y), the subproblem can be computed using space efficient alignment, f(i, n/2) + g(i, n/2).  We need to solve subproblems recursively and reuse the space.  There is only one call at a time so the total space is O(m+n).

DC-align X Y
  P is a global list
  m is length of X
  n is length of Y
  if m <= 2 or n <= 2
    align X Y
  space-efficient-align X Y[1..n/2]
  backward-space-efficient-align X Y[n/2+1..n]
  let q be the index minimizing f(q,n/2) + g(q,n/2)
  add (q,n/2) to P
  DC-align X[1..q] Y[1..n/2]
  DC-align Y[q+1..m] Y[n/2+1..n]
return P

The running time of DC-align is O(mn).

Tools

You need to use the sequence align and som-executable package to try the examples

End