last  page PLNT4610/PLNT7690 Bioinformatics
Lecture 6, part 2 of 2
next page

3. Constructing  multiple alignments

a. Dynamic Programming

An optimal multiple alignment depends on all possible pairwise comparisons of all amino acids in each protein at each position.  It is possible to extend the dynamic programming algorithm for pairwise comparisons from the special case of k=2 sequences to the more general case of k=n sequences.

Scoring must account for those comparisons. Consider the alignment

MQPILLL
MLR-LL-
MK-ILLL
MPPVLIL


The sum-of-pairs (SP) function scores each position in the protein, that is, each column, as the sum of the pairwise scores. For k sequences, there are k(k-1)/2 unique pairwise comparisons, excluding self comparisons. For example, in column 4, the score would be

SP-score(I,-,I,V) =  p(I,-) + p(I,I) + p(I,V) + p(-,I) + p(-,V) + p(I,V)

where p(a,b) is the pairwise score of two amino acids. This function is valid regardless of the order of sequences in the alignment.

Unlike pairwise alignments, it is legitimate in multiple alignments to have a match between two gap characters. By definition , p(-,-) = 0. Although one might be tempted to score a gap as highly negative, when two sequences match at a gap, it indicates that both sequences have a deletion at that position, that is not shared by other sequences. Matching gaps therefore should get a higher score than, for example, a mismatch between two very different amino acids, such as Proline and Lysine.

Generalizing the pairwise dynamic programming algorithm to k sequences is, in essence, doing an alignment in k dimensions. For example, an optimal alignment between k=3 sequences could be visualized in 3 dimensions:

Figure 2 from Zucker [http://www.genetics.wustl.edu/bio5495/1999-course/lecture.7/]

The problem with the dynamic programming algorithm is size and speed. To start with, where k sequences of length n are compared, the storage required for the alignment matrix is O(nk). For 20 sequences of 500 amino acids each, it would require 50020 units of storage (eg. bytes).

Recall that in a pairwise  comparison, for each cell in the 2-dimensional array the score was dependent on the 3 preceeding adjacent cells. In a k-dimensional array, there are  2k-1 preceeding cells. (This makes sense: for k=2 ie. a pairwise comparison, 22-1 = 3). Finally, for each column, there are k(k-1)/2 pairwise comparisons.

In summary, the time required for a truely exhaustive multiple alignment, using the most straightforward approach, is O(k22knk).

Figure from Zucker [http://www.genetics.wustl.edu/bio5495/1999-course/lecture.7/]

Although some methods have introduced great efficiencies that bring global dynamic programming to a level that is practical for a handful of sequences, more efficient approximate methods are needed for typical alignment problems.

Probably the single most important speedup method for dynamic programming methods [Carillo and  Lipman (1988) SIAM J. Appl. Math.48:1073-1082, Altschul and Lipman (1989) SIAM J. Appl. Math. 49:197-209]. This method only scales to about 8 sequences, so it is not usually practical.

Optimal multiple alignments are an example of NP complete problems. NP complete problems are problems whose solutions can not be done in polynomial time. Polynomial time requires O(n c ) steps, where c is a constant. In contrast, an NP complete problem requires O(nx ) steps, where x is proportional to the size of the dataset.


b. Heuristic alignment methods

It is often the case that good approximations to an otherwise intractible problem can be found by heuristic methods. Heuristic methods take a "learn as you go" strategy, in which the algorithm solves small parts of the problem at a time, and gradually converges on a solution.

Heuristic methods are not guaranteed to find the optimal solution, but can sometimes be shown to produce a solution that is close to optimal. Heuristics are often very sensitive to the starting conditions, which are often arbitrary. That is, different starting conditions may yeild different solutions.

1. CLUSTAL, TCOFFEE - global alignment using Neighbor-Joining Trees

(Note: TCOFFEE should be considered the successor to the CLUSTAL family of programs. It is generally advisable to use TCOFFEE, rather than the older CLUSTAL.)

Hierarchical clustering methods work by aligning small sets of closely-related sequences into sub-alignments, and then aligning the alignments, until all sequences are represented in an alignment.

Hierarchical methods work on the hypothesis that sequences in an alignment will reflect their evolutionary history. That is, if one were to go from one sequence to the next most-closely related one, one would visit all nodes on the phylogenetic tree describing how these sequences evolved.

As illustrated, the sequences to be aligned are the end nodes, the "leaves" of the tree.

Algorithm:
Calculate distances between all possible pairs of sequences
Construct a Neighbor-Joining tree from pairwise distances
while not all nodes on the tree have been visited
align each pair of sequences or profiles at the terminal nodes
replace aligned sequences with a profile representing the alignment
     of all sequences in below that node

In effect, this algorithm keeps going deeper and deeper into the tree, clustering larger and larger groups of sequences. Clusters (profiles) are merged, until the root of the tree is reached.

A subtle feature that distinguishes tree alignments from star alignments is that distances, between nodes (ie. between sequences) are calculated, rather than similarities. Distances are the amount subtracted from a score, due to amino acid substitutions,  when amino acid sequence is transformed into another.
 
from G. Fuellen, Multiple Alignment. Complexity International 4, 1997. URL http://journal-ci.csse.monash.edu.au/ci/vol04/mulali/mulali.html.

 Again, as in the star alignment, once a gap always a gap.

The efficiency of a tree alignment is comparable to that of a star alignment.

2. DIALIGN-T - recursive alignment of high-scoring local diagonals

Amarendran R Subramanian , Jan Weyer-Menkhoff , Michael Kaufmann  and Burkhard Morgenstern. DIALIGN-T: An improved algorithm for segment-based multiple sequence alignment. BMC Bioinformatics 2005, 6:66    doi:10.1186/1471-2105-6-66

Morgenstern B (1999) DIALIGN2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 15:211-218.

Most of the commonly-used multiple sequence alignment programs work in a fashion similar to CLUSTAL and TCOFFEE, buy building the alignment based on a guide tree. There are several problems with this approach:

1. The guide tree strongly influences the final alignment.
The final alignment is highly dependent on the order in which sequences are added to the alignment. The sequences added first have a strong influence on the final alignment, whereas the sequences added last have less of an influence. The order of addition is based on a crude Neighbor Joining tree. Normally, pairwise distances for an NJ tree would be calculated from a multiple alignment. Since the alignment has not yet been done, pairwise distances are based on individual pairwise alignments. Thus, the least robust tree construction method guides the construction of the alignment.

2. The guide tree approach assumes that all parts of all sequences are consistent with a global alignment.
A global multiple alignment, optimal across all sequences, assumes that every position (column) in an alignment represents a set of homologous sites, present in all sequences. This assumption is seldom true in gene families which have had a chance to undergo insertions, deletions and rearrangements over time.

To address these issues, Morgenstern and colleagues have devised a multiple alignment algorithm which aligns segments of sequence for which strong evidence for homology exist. This method does not require that each segment be present in all sequences contributing to the alignment.

DIALIGN constructs multiple alignments from gap-free pairs of sequence segments, often referred to as "diagonals".  DIALIGN conceptualizes a multiple alignment as being an alignment of diagonals individual pairwise alignments. Global alignments align positions, considering each position in the alignment in isolation from all other positions. DIALIGN aligns diagonals.

In order to be included in an alignment, two diagonals from a given pariwise alignment must be consistent. "A collection of diagonals is called consistent if there exists an alignment such that all segment pairs are matched."

In Figure 1a, a pairwise alignment between two sequences contains two high-scoring diagonals.

a)
As shown in b, it is possible to break a single diagonal such as f in to several higher-scoring diagonals. Note that only high-scoring regions of the original diagonal are included in the smaller diagonals.

b)
Figure 1c shows that better small diagonals might be obtained from the same sequences by shifting (ie. equivalent to inserting gaps) off of the main diagonal.

c)
Based on Figure 1 from Morgenstern B (1999) DIALIGN2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 15:211-218.

DIALIGN typically requires three iteration steps to construct an alignment. In the first iteration, all possible pairwise alignments are done on the sequence set, and diagonals are scored. Scores include both a raw score for the comlete diagonal, as well as an overlap weight which reflects the extent to which one diagonal overlaps another. A high degree of overlap weight favors diagonals occurring in more than two sequences. For the three sequences shown in the first iteration step below, four diagonals are initially included in the alignment set M1 = {D1, D2, D3 and D4}. The set of diagonals {D1,D2, D4} consistent, because all three can be included in an alignment spanning the three sequences.  D3 is not consistent with this alignment, so it is discarded from the set of diagonals, leaving M1 = {D1,D2,D4}.


"In the second iteration step, those parts of the sequences that are not yet directly aligned are realigned."
A new diagonal, D5, is discovered, which replaces {D2,D4}. Thus alignment set M2 = {D1,D5}.


A third iteration step searches for new diagonals that have not already been discovered. In the third iteration, if no new diagonals are discovered, the alignment is considered to be complete. Typically, DIALIGN only requires three iteration steps.

In essence, DIALIGN can be thought of as implementing dynamic programming on diagonals, rather than on individual amino acid or nucleotide positions. It is therefore quite fast.

The time complexity of DIALIGN is O(N4 x L2), where N is the number of sequences, and L is the maximum length of the sequences.

One of the advantages of DIALIGN cited by the authors is that unlike other methods, there are few parameters to adjust. In particular, DIALIGN does not explicitly consider gaps, and therefore, does not need to assign gap penalties. This is an advantage, because there is nor rigorous model for how to calculate gap penalties. With other alignment methods, the choice of begin_gap and extend_gap penalties is largely an arbitrary judgement.

last  page PLNT4610/PLNT7690 Bioinformatics
Lecture 6, part 2 of 2
next page