Universität Bielefeld - Technische Fakultät - AG Praktische Informatik - FSPM - Strukturbildungsprozesse
Divide-and-Conquer Multiple Sequence AlignmentSimultaneous Multiple Sequence Alignment |
|
|---|
This page gives some background on the problem of finding sum-of-pairs multiple sequence alignments, which are approximated by DCA. The text is from the definition section of Stoye, 1997. For further reference on alignments and similar concepts, see e.g. (Sankoff & Kruskal, 1983; Doolittle, 1990; Waterman, 1995; Morgenstern et al., 1997).
Suppose that we are given a family
of k sequences:
of various lengths n1 to nk ,
where each sequence entry sij represents a letter from a
given finite alphabet
.
An alignment
of the sequences S is a matrix
where
Next, assume that for each pair of rows
and
in an
alignment,
a (non-negative) score function
has been defined
which measures the quality of the alignment of
and
defined by these two rows (upon elimination of common gaps),
and denote by
the minimum of
,
taken over all alignments
(which we suppose to vanish if and only if
coincides
with
).
See Waterman, 1995
for a thorough discussion of score functions for
pairwise alignment and note that, as an alternative to minimizing a
dissimilarity score, one may also aim to maximise an appropriately
defined similarity score.
The weighted sum-of-pairs score
(Altschul & Lipman, 1989)
for an alignment
relative to a given family of (generally non-negative) weight parameters
is now defined by
The multiple alignment problem then is to
find matrices
whose weighted sum of pairs score
w(M) is minimal.
The logic for introducing the weight parameters
(from which a simple version is implemented in DCA, see below)
is the following: In general,
any set of related biological sequences
contains some sequences which are
more closely related to one another than to the
remaining ones, and highlighting their similarity, at least to some
extent, might often be more
important than forcing them to independently
conform to the patterns of the other sequences.
On the other hand, as almost any sample of sequences is biased in one
way or the other (even, most probably, the sample provided by Nature itself),
a perhaps over-represented subset of highly homologous
sequences in a data set should not be allowed through its sheer
size to force all the other sequences to conform to its
patterns. Both goals, highlighting similarity between closely related
sequences and discounting over-representation of certain subclasses of
sequences can (hopefully) be achieved by choosing appropriate
weight factors.
In our context, it seems biologically plausible to give higher weights
to the more similar pairs of sequences, as having them aligned
optimally should be more important than aligning two fairly
unrelated sequences optimally on the expense of worsening
a good alignment of closely related sequences.
Therefore, our internal sequence weights are computed by the
following formula:
where
,
,
and
is a (user-specifiable) parameter, the weight intensity.
The weight intensity may be set to any value between
(maximum weighting: smallest weights
for (p,q) with
,
highest weights
for (p,q) with
)
and
(no weighting:
for all
).
It is well known that the multiple sequence alignment problem
can be solved optimally by dynamic programming
with a running time proportional to
,
searching for a shortest alignment path in a directed graph
with
vertices.
Faster variants and a number of speed-ups have therefore been proposed:
the most successful concepts are implemented in the alignment tool
MSA (see the
MSA home page)
which is based on a branch-and-bound technique developed by
Carrillo & Lipman (1988),
efficiently implemented by
Gupta et al., 1995.
However, due to the fact than the multiple sequence alignment problem is NP-complete (Wang and Jiang, 1994), any attempt of developing a fast algorithm to compute optimal multiple sequence alignments is expected to fail.
That is why most of the practical alignment programs are based on heuristics producing near-optimal alignments. The fastest, so-called iterative alignment methods, progressively align pairs of sequences. In contrast - although heuristically as well - DCA aligns the sequences simultaneously.