Universität Bielefeld - Technische Fakultät - AG Praktische Informatik - FSPM - Strukturbildungsprozesse

Divide-and-Conquer Multiple Sequence Alignment

The DCA Algorithm
DCA home

This page gives a brief overview of the mathematics and algorithmics behind DCA. More thorough descriptions can be found in the journal articles on the DCA references page.

Basic Algorithm

The general idea of DCA is rather simple: Each sequence is cut in two behind a suitable cut position somewhere close to its midpoint. This way, the problem of aligning one family of (long) sequences is divided into the two problems of aligning two families of (shorter) sequences, the prefix and the suffix sequences. This procedure is re-iterated until the sequences are sufficiently short - say, shorter than a pregiven stop size L which is a parameter of DCA - so that they can be aligned optimally by MSA. Finally, the resulting short alignments are concatenated, yielding a multiple alignment of the original sequences. The following figure sketches this general procedure.


 
Figure 1: The DCA principle.
\begin{figure}\centering\includegraphics[width=5in]{fig/divide.eps}
\end{figure}

Computing Cut Positions

Of course, the main difficulty with this approach is how to identify those cut-position combinations which lead to an optimal or - at least - close to optimal concatenated alignment. Here, a heuristic based on so-called additional-cost matrices which are used for quantifying the compatibility of cut positions in distinct sequences proved to be successful.

More precisely, given a sequence $\ensuremath{\mathbf{s}} = s_1s_2\ldots s_n$ of length n and a cut position $c\:(0\leq c\leq n)$, we denote by $\ensuremath{\mathbf{s}} (\leq c)$ the prefix sequence $s_1s_2\ldots s_c$ and by $\ensuremath{\mathbf{s}} (>c)$ the suffix sequence $s_{c+1}s_{c+2}\ldots s_{n}$, and we use the dynamic programming procedure to compute, for all pairs of sequences $(\ensuremath{\mathbf{s}} _p,\ensuremath{\mathbf{s}} _q)$and for all cut positions cp of $\ensuremath{\mathbf{s}} _p$ and cq of $\ensuremath{\mathbf{s}} _q$, the additional cost $C_{\ensuremath{\mathbf{s}} _p,\ensuremath{\mathbf{s}} _q}[c_p,c_q]$defined by \begin{displaymath}C_{\ensuremath{\mathbf{s}} _p,\ensuremath{\mathbf{s}} _q}[c_p...
...w_{opt}(\ensuremath{\mathbf{s}} _p,\ensuremath{\mathbf{s}} _q)
\end{displaymath}which quantifies the additional charge imposed by forcing the alignment of $\ensuremath{\mathbf{s}} _p$ and $\ensuremath{\mathbf{s}} _q$ to optimally align the two prefix sequences $\ensuremath{\mathbf{s}} _p(\leq c_p)$ and $\ensuremath{\mathbf{s}} _q(\leq c_q)$ as well as the two suffix sequences $\ensuremath{\mathbf{s}} _p(> c_p)$ and $\ensuremath{\mathbf{s}} _q(> c_q)$, rather than aligning $\ensuremath{\mathbf{s}} _p$and $\ensuremath{\mathbf{s}} _q$ optimally. The following figure illustrates this definition.


 
Figure 2: Definition of the additional cost.
\begin{figure}\centering\includegraphics[width=5in]{fig/triangle.eps}
\end{figure}

The calculation of the matrices $C_{\ensuremath{\mathbf{s}} _p,\ensuremath{\mathbf{s}} _q}$ is performed in time $O(\vert\ensuremath{\mathbf{s}} _p\vert \cdot \vert\ensuremath{\mathbf{s}} _q\vert)$by computing forward and reverse matrices in a similar way as described by Vingron & Argos, 1991. Note that there exists, for every fixed cut position $\hat c_p$ of $\ensuremath{\mathbf{s}} _p$, at least one cut position $c_q(\hat c_p)$ of $\ensuremath{\mathbf{s}} _q$with $C_{\ensuremath{\mathbf{s}} _p,\ensuremath{\mathbf{s}} _q}[\hat c_p,c_q(\hat c_p)] = 0$ which can be computed easily from any optimal pairwise alignment of $\ensuremath{\mathbf{s}} _p$ and $\ensuremath{\mathbf{s}} _q$. The problem multiple alignments have to face, is that given a cut position $\hat c_r$ of $\ensuremath{\mathbf{s}} _r$, the optimal cut position $c_q\left(c_p(\hat c_r)\right)$ of $\ensuremath{\mathbf{s}} _q$relative to the cut position $c_p(\hat c_r)$ of $\ensuremath{\mathbf{s}} _p$ might not coincide with the optimal cut position $c_q\left(\hat c_r\right)$ of $\ensuremath{\mathbf{s}} _q$ relative to the cut position $\hat c_r$ of $\ensuremath{\mathbf{s}} _r$. In other words, pairwise optimal alignments may be incompatible with one another - much in analogy to frustrated systems considered in statistical physics.

To search for good k -tuples of cut positions, we therefore define the multiple additional charge $C(c_1,\ldots,c_k)$imposed by slicing the sequences at any given k -tuple of cut positions $(c_1,\ldots,c_k)$ as a weighted sum of secondary charges over all projections (cp,cq) , that is, we put

\begin{displaymath}C(c_1,c_2,\ldots,c_k)\;:=\;
\sum_{1\leq p<q\leq k}\alpha_{p,q...
...nsuremath{\mathbf{s}} _p,\ensuremath{\mathbf{s}} _q}[c_p,c_q],
\end{displaymath}where the $\alpha_{p,q}$ are the sequence-dependent weight factors defined in context of weighted sum-of-pairs alignment score.

Our proposition is now that using those k -tuples as the preferred cut-position combinations that minimize - for a given fixed cut position $\hat c_p$ of, say, the longest sequence $\ensuremath{\mathbf{s}} _p$which we choose in the middle of that sequence - the value $C(c_1,\ldots,c_{p-1},\hat c_p,c_{p+1},\ldots,c_k)$ over all cut positions $c_1,\ldots,$ $c_{p-1},c_{p+1},\ldots,c_k$ of $\ensuremath{\mathbf{s}} _1,\ldots,\ensuremath{\mathbf{s}} _{p-1},\ensuremath{\mathbf{s}} _{p+1},\ldots,\ensuremath{\mathbf{s}} _k$, respectively, will result in very good, if not optimal multiple alignments because, in this way, the mutual frustration is distributed as fairly as possible.

Summary

In conclusion, reiterating this process until all sequences in any given subsequence family have a length not exceeding the stop size L , DCA simply performs the following general procedure:


\begin{tabular}{llll}
\multicolumn{4}{l}{{\bf Algorithm} {\it DCA}$\,(\,\ensurem...
...ensuremath{\mathbf{s}} _2,
\ldots,\ensuremath{\mathbf{s}} _k)$ .}
\end{tabular}

The subroutine calc-cut computes a C -optimal k -tuple of cut positions as described above (for details regarding suitable branch-and-bound approaches towards that optimization problem, see Stoye et al., 1997; Perrey & Stoye, 1996; Brinkmann et al., 1997).



J. Stoye, V. Moulton