Universität Bielefeld - Technische Fakultät - AG Praktische Informatik - FSPM - Strukturbildungsprozesse
Divide-and-Conquer Multiple Sequence AlignmentThe DCA Algorithm |
|
|---|
This page gives a brief overview of the mathematics and algorithmics behind DCA. More thorough descriptions can be found in the journal articles on the DCA references page.
The general idea of DCA is rather simple: Each sequence is cut in two behind a suitable cut position somewhere close to its midpoint. This way, the problem of aligning one family of (long) sequences is divided into the two problems of aligning two families of (shorter) sequences, the prefix and the suffix sequences. This procedure is re-iterated until the sequences are sufficiently short - say, shorter than a pregiven stop size L which is a parameter of DCA - so that they can be aligned optimally by MSA. Finally, the resulting short alignments are concatenated, yielding a multiple alignment of the original sequences. The following figure sketches this general procedure.
Of course, the main difficulty with this approach is how to identify those cut-position combinations which lead to an optimal or - at least - close to optimal concatenated alignment. Here, a heuristic based on so-called additional-cost matrices which are used for quantifying the compatibility of cut positions in distinct sequences proved to be successful.
More precisely,
given a sequence
of length n and a
cut position
,
we denote by
the
prefix sequence
and by
the suffix
sequence
,
and
we use the dynamic programming procedure to compute, for
all pairs of sequences
and for all cut positions cp of
and
cq of
,
the additional cost
defined by
which quantifies the additional charge imposed by forcing the alignment
of
and
to optimally align the two prefix sequences
and
as well as the two suffix sequences
and
,
rather than aligning
and
optimally.
The following figure illustrates this definition.
The calculation of the matrices
is performed
in time
by computing forward and reverse matrices
in a similar way as described by
Vingron & Argos, 1991.
Note that there exists, for every fixed cut position
of
,
at least one cut position
of
with
which can be
computed easily from any optimal pairwise alignment of
and
.
The problem multiple alignments have to face, is that
given a cut position
of
,
the optimal cut position
of
relative to the cut position
of
might not coincide with the optimal cut position
of
relative to the cut position
of
.
In other words, pairwise optimal alignments may be
incompatible with one another - much in analogy to frustrated systems
considered in statistical physics.
To search for good k -tuples of cut positions,
we therefore define the multiple additional charge
imposed by
slicing the sequences at any given k -tuple of cut positions
as a weighted sum of secondary charges over all projections
(cp,cq) , that is, we put
where the
are the sequence-dependent weight factors
defined in context of
weighted sum-of-pairs alignment score.
Our proposition is now that using those k -tuples as the
preferred cut-position combinations
that minimize - for a given fixed cut position
of, say, the
longest sequence
which we choose in the middle of that sequence - the
value
over all cut positions
of
,
respectively,
will result in very good, if not optimal multiple alignments because,
in this way, the mutual frustration is distributed as fairly as
possible.
In conclusion, reiterating this process until all sequences in any given subsequence family have a length not exceeding the stop size L , DCA simply performs the following general procedure:
The subroutine calc-cut computes a C -optimal k -tuple of cut positions as described above (for details regarding suitable branch-and-bound approaches towards that optimization problem, see Stoye et al., 1997; Perrey & Stoye, 1996; Brinkmann et al., 1997).