Universität Bielefeld - Technische Fakultät - AG Praktische Informatik - FSPM - Strukturbildungsprozesse

Divide-and-Conquer Multiple Sequence Alignment

Simultaneous versus Iterative Multiple Alignments
DCA home

This page is hypertext adaptation of a section from Perrey et al., 1997. For other surveys on multiple sequence alignment on the Web, see e.g. that of Georg Füllen.

Iterative Alignment Methods

Iterative alignment methods (such as Clustal W, see also BCM Search Launcher: Multiple Sequence Alignments) are very fast, and they can align, in principle, any number of sequences. Over the years, they have successfully incorporated growing biological knowledge regarding e.g. the use of appropriate substitution matrices, region-dependent gap-penalties and weighting schemes (see for instance Thompson et al., 1994). The resulting alignments are quite acceptable for families of moderately diverged sequences. Yet, when it comes to align a family of highly diverged sequences, they easily run into local, but not necessarily global optima, - a risk which is of course inherent in any hill-climbing bottom-up method.

In addition, while a simultaneous alignment procedure (such as MSA or DCA) tries to optimize some well-defined score function for multiple sequence alignment (e.g. the weighted sum-of-pairs score, see also Altschul & Lipman, 1989), iterative methods often do not even accept such a function (which might be tested for its biological significance and improved) as a standard of truth.

A further problem of iterative alignment procedures is that their results depend crucially on the precalculated order in which the sequences are processed , while simultaneous alignment procedures take all sequences into account simultaneously and therefore do not depend on any precalculated grouping of the sequences in question.

Most importantly, iterative methods perform a series of pairwise alignments whereby pairs of more closely related sequences - or pairs of already aligned subfamilies of sequences - are considered first. This can go wrong for the following reasons:

1.
There is often more than one optimal alignment for any given pair of sequences - or pair of already aligned subfamilies - so that an arbitrary decision has to be made for choosing one of those for further consideration (at least, unless additional effort is spent to find out which of those might be biologically the most plausible one - or to keep track of all of them).
2.
Optimal alignments are often highly sensitive with respect to parameter changes, in particular, when more distantly related sequences (or subfamilies) are to be aligned. Consequently, the set of optimal pairwise alignments one has to choose from may depend strongly on the parameters used in the calculation.
3.
Once all members of some closely related subfamily have been considered, the alignment of this subfamily is locked so that any more distantly related sequences can never have an influence on this sub-alignment. Yet, it was noticed that reopening such a locked alignment for performing pairwise alignments of two subfamilies or simultaneous three-way alignments often improves the overall alignment. However, even these procedures are still based on locking some sub-alignments.

Simultaneous Alignment Methods

Simultaneous alignment procedures offer the following advantages over iterative ones:

1.
Because such procedures align a family of sequences in one step, the ``multiple optima problem'' does not present itself at all in the construction process. Of course, there can also be more than one ``final'' optimal alignment of the whole sequence family. However, at this final stage, that might actually present quite interesting and valuable information, e.g. for figuring out ``uncertain alignment sites''.
2.
More often than not, a simultaneous alignment procedure should be more robust against parameter changes than procedures based on pairwise alignments, particularly for data sets consisting of a comparatively large number of sequences.
3.
Already D.J. Lipman et al. (Lipman et al., 1989) observed that the quality of a simultaneous alignment of a sequence family (quality in terms of the number of alignment sites which agree with an alignment based on the structure of the molecules involved) generally increases with the number of members in that family. And they observed that that quality can even be increased by including one or two ``outgroup sequences'' - that is, more distantly related sequences - in a simultaneous alignment procedure, a fact which almost by definition cannot hold for iterative procedures.

The Problem of Choosing Alignment Parameters

A further problem in biological sequence alignment is that given any two sequences, their biologically most plausible pairwise alignment often is not optimal but just suboptimal, and sometimes remains suboptimal even after trying carefully and specifically to optimize the employed set of alignment parameters with regard to the sequences in question. The reason for this is that the mutation process in different areas of a biomolecule can hardly be modelled by one and the same model, and in particular not by a model which presupposes the validity of some kind of a general stochastic Fermat Principle for such a process. For instance, it is well-known that the residues in biomolecular sequences are often correlated and do not evolve by some purely stochastic mutation process, independently of the remaining residues. Yet, because these correlations are very difficult to model correctly and, provided even that that could be achieved, are almost impossible to handle computationally, almost all alignment procedures (including MSA and DCA, so far) employ a single substitution cost matrix across the entire sequence. And even more problems arise when it comes to model appropriately the probabilities for the occurrence of insertions and deletions.

Yet, it is virtually impossible to handle all possibly relevant suboptimal alignments in a computationally acceptable way because, unfortunately, the number of alignments to be taken into consideration generally grows enormously upon even the slightest relaxation of the optimality criterion (Waterman, 1983). Therefore, iterative methods can go wrong in

Instead, a simultaneous alignment procedure forces every pair of sequences into a certain suboptimal pairwise alignment, depending on all other sequences of the family to be aligned, because it attempts to optimize an overall score function. As it happens, these suboptimal pairwise alignments often are the biologically (more) plausible ones, because they are calculated relative to the other sequences of the family under consideration (see the example below).

It was also noticed that phylogenetic trees calculated from sequence alignments produced by iterative methods had a bias towards the tree that these methods used just for processing the sequences (Lake, 1991; Thorne & Kishino, 1992). In contrast, an alignment procedure which takes all the sequences to be aligned into account simultaneously, produces alignments for subsequent phylogenetic tree reconstruction which are not biased by a previously chosen ``guide tree''.



J. Stoye, V. Moulton