Universität Bielefeld - Technische Fakultät - AG Praktische Informatik - FSPM - Strukturbildungsprozesse

Divide-and-Conquer Multiple Sequence Alignment

Simultaneous Multiple Sequence Alignment
DCA home

This page gives some background on the problem of finding sum-of-pairs multiple sequence alignments, which are approximated by DCA. The text is from the definition section of Stoye, 1997. For further reference on alignments and similar concepts, see e.g. (Sankoff & Kruskal, 1983; Doolittle, 1990; Waterman, 1995; Morgenstern et al., 1997).

Sequence Alignments

Suppose that we are given a family $S = (\ensuremath{\mathbf{s}} _1,\ldots,\ensuremath{\mathbf{s}} _k)$of k sequences:


\begin{eqnarray*}\ensuremath{\mathbf{s}} _1 & = & s_{11} \, s_{12} \ldots s_{1n_...
...ensuremath{\mathbf{s}} _k & = & s_{k1} \, s_{k2} \ldots s_{kn_k}
\end{eqnarray*}

of various lengths n1 to nk , where each sequence entry sij represents a letter from a given finite alphabet ${\cal A}$. An alignment of the sequences S is a matrix $M = (m_{ij})_{1\leq i\leq k, 1\leq j\leq N}$ where

We denote the set of all alignments of S by MS (see also Morgenstern et al., 1997 for a more general definition of alignment).

The Sum-of-Pairs Alignment Problem

Next, assume that for each pair of rows $\ensuremath{\mathbf{m}} _p$ and $\ensuremath{\mathbf{m}} _q$ in an alignment, a (non-negative) score function $w(\ensuremath{\mathbf{m}} _p,\ensuremath{\mathbf{m}} _q)$ has been defined which measures the quality of the alignment of $\ensuremath{\mathbf{s}} _p$ and $\ensuremath{\mathbf{s}} _q$defined by these two rows (upon elimination of common gaps), and denote by $w_{opt}(\ensuremath{\mathbf{s}} _p,\ensuremath{\mathbf{s}} _q)$ the minimum of $w(\ensuremath{\mathbf{m}} _p,\ensuremath{\mathbf{m}} _q)$, taken over all alignments $M \in M_{\{\ensuremath{\mathbf{s}} _p,\ensuremath{\mathbf{s}} _q\}}$(which we suppose to vanish if and only if $\ensuremath{\mathbf{s}} _p$ coincides with $\ensuremath{\mathbf{s}} _q$). See Waterman, 1995 for a thorough discussion of score functions for pairwise alignment and note that, as an alternative to minimizing a dissimilarity score, one may also aim to maximise an appropriately defined similarity score.

The weighted sum-of-pairs score (Altschul & Lipman, 1989) for an alignment $M\in M_S$ relative to a given family of (generally non-negative) weight parameters $\alpha_{p,q}\:(1\le p<q \le k)$ is now defined by \begin{displaymath}w(M) := \sum_{1\leq p<q\leq k} \alpha_{p,q}\cdot w(\ensuremath{\mathbf{m}} _p,\ensuremath{\mathbf{m}} _q).
\end{displaymath}The multiple alignment problem then is to find matrices $M\in M_S$ whose weighted sum of pairs score w(M) is minimal.

Sequence Weights

The logic for introducing the weight parameters $\alpha_{p,q}$ (from which a simple version is implemented in DCA, see below) is the following: In general, any set of related biological sequences contains some sequences which are more closely related to one another than to the remaining ones, and highlighting their similarity, at least to some extent, might often be more important than forcing them to independently conform to the patterns of the other sequences. On the other hand, as almost any sample of sequences is biased in one way or the other (even, most probably, the sample provided by Nature itself), a perhaps over-represented subset of highly homologous sequences in a data set should not be allowed through its sheer size to force all the other sequences to conform to its patterns. Both goals, highlighting similarity between closely related sequences and discounting over-representation of certain subclasses of sequences can (hopefully) be achieved by choosing appropriate weight factors.

In our context, it seems biologically plausible to give higher weights to the more similar pairs of sequences, as having them aligned optimally should be more important than aligning two fairly unrelated sequences optimally on the expense of worsening a good alignment of closely related sequences. Therefore, our internal sequence weights are computed by the following formula:

\begin{eqnarray*}\alpha_{p,q}
& := &
\left\{\begin{array}{ll} 1 & \mathtt{if}\ \...
...e}}%
{\mathit{maxscore}} & \mathtt{otherwise}
\end{array}\right.
\end{eqnarray*}

where $\mathit{minscore}\! :=\!
\min_{1\leq i<j\leq k}\{\overline{w_{opt}}(\ensuremath{\mathbf{s}} _i,\ensuremath{\mathbf{s}} _j)\}$, $\mathit{maxscore}\! :=\!
\max_{1\leq i<j\leq k} \{\overline{w_{opt}}(\ensuremath{\mathbf{s}} _i,\ensuremath{\mathbf{s}} _j)\}$, and $\lambda$ is a (user-specifiable) parameter, the weight intensity. The weight intensity may be set to any value between $\lambda = 1$(maximum weighting: smallest weights for (p,q) with $\overline{w_{opt}}(\ensuremath{\mathbf{s}} _p,\ensuremath{\mathbf{s}} _q)=\mathit{maxscore}$, highest weights $\alpha_{p,q}=1$for (p,q) with $\overline{w_{opt}}(\ensuremath{\mathbf{s}} _p,\ensuremath{\mathbf{s}} _q)=\mathit{minscore}$) and $\lambda = 0$(no weighting: $\alpha_{p,q}=1$ for all $1\leq p<q\leq k$).

Dynamic Programming

It is well known that the multiple sequence alignment problem can be solved optimally by dynamic programming with a running time proportional to $(2^k - 1) \cdot \prod_{i=1}{k} n_i$, searching for a shortest alignment path in a directed graph with $\prod_{i=1}^k n_i$ vertices. Faster variants and a number of speed-ups have therefore been proposed: the most successful concepts are implemented in the alignment tool MSA (see the MSA home page) which is based on a branch-and-bound technique developed by Carrillo & Lipman (1988), efficiently implemented by Gupta et al., 1995.

However, due to the fact than the multiple sequence alignment problem is NP-complete (Wang and Jiang, 1994), any attempt of developing a fast algorithm to compute optimal multiple sequence alignments is expected to fail.

That is why most of the practical alignment programs are based on heuristics producing near-optimal alignments. The fastest, so-called iterative alignment methods, progressively align pairs of sequences. In contrast - although heuristically as well - DCA aligns the sequences simultaneously.



J. Stoye, V. Moulton