Universität Bielefeld - Technische Fakultät - AG Praktische Informatik - FSPM - Strukturbildungsprozesse
Divide-and-Conquer Multiple Sequence AlignmentSimultaneous versus Iterative Multiple Alignments |
|
|---|
This page is hypertext adaptation of a section from Perrey et al., 1997. For other surveys on multiple sequence alignment on the Web, see e.g. that of Georg Füllen.
Iterative alignment methods (such as Clustal W, see also BCM Search Launcher: Multiple Sequence Alignments) are very fast, and they can align, in principle, any number of sequences. Over the years, they have successfully incorporated growing biological knowledge regarding e.g. the use of appropriate substitution matrices, region-dependent gap-penalties and weighting schemes (see for instance Thompson et al., 1994). The resulting alignments are quite acceptable for families of moderately diverged sequences. Yet, when it comes to align a family of highly diverged sequences, they easily run into local, but not necessarily global optima, - a risk which is of course inherent in any hill-climbing bottom-up method.
In addition, while a simultaneous alignment procedure (such as MSA or DCA) tries to optimize some well-defined score function for multiple sequence alignment (e.g. the weighted sum-of-pairs score, see also Altschul & Lipman, 1989), iterative methods often do not even accept such a function (which might be tested for its biological significance and improved) as a standard of truth.
A further problem of iterative alignment procedures is that their results depend crucially on the precalculated order in which the sequences are processed , while simultaneous alignment procedures take all sequences into account simultaneously and therefore do not depend on any precalculated grouping of the sequences in question.
Most importantly, iterative methods perform a series of pairwise alignments whereby pairs of more closely related sequences - or pairs of already aligned subfamilies of sequences - are considered first. This can go wrong for the following reasons:
Simultaneous alignment procedures offer the following advantages over iterative ones:
A further problem in biological sequence alignment is that given any two sequences, their biologically most plausible pairwise alignment often is not optimal but just suboptimal, and sometimes remains suboptimal even after trying carefully and specifically to optimize the employed set of alignment parameters with regard to the sequences in question. The reason for this is that the mutation process in different areas of a biomolecule can hardly be modelled by one and the same model, and in particular not by a model which presupposes the validity of some kind of a general stochastic Fermat Principle for such a process. For instance, it is well-known that the residues in biomolecular sequences are often correlated and do not evolve by some purely stochastic mutation process, independently of the remaining residues. Yet, because these correlations are very difficult to model correctly and, provided even that that could be achieved, are almost impossible to handle computationally, almost all alignment procedures (including MSA and DCA, so far) employ a single substitution cost matrix across the entire sequence. And even more problems arise when it comes to model appropriately the probabilities for the occurrence of insertions and deletions.
Yet, it is virtually impossible to handle all possibly relevant suboptimal alignments in a computationally acceptable way because, unfortunately, the number of alignments to be taken into consideration generally grows enormously upon even the slightest relaxation of the optimality criterion (Waterman, 1983). Therefore, iterative methods can go wrong in
It was also noticed that phylogenetic trees calculated from sequence alignments produced by iterative methods had a bias towards the tree that these methods used just for processing the sequences (Lake, 1991; Thorne & Kishino, 1992). In contrast, an alignment procedure which takes all the sequences to be aligned into account simultaneously, produces alignments for subsequent phylogenetic tree reconstruction which are not biased by a previously chosen ``guide tree''.