| PLNT4610/PLNT7690
Bioinformatics Lecture 6, part 2 of 2 |
Scoring must account for those comparisons. Consider the alignment
MQPILLL
MLR-LL-
MK-ILLL
MPPVLIL
The sum-of-pairs (SP)
function scores each position in the protein,
that is, each column, as the sum of the pairwise scores. For k
sequences,
there are k(k-1)/2 unique pairwise comparisons, excluding self
comparisons.
For example, in column 4, the score would be
SP-score(I,-,I,V) = p(I,-) + p(I,I) + p(I,V) + p(-,I) + p(-,V) + p(I,V)
where p(a,b) is the pairwise score of two amino acids. This function is valid regardless of the order of sequences in the alignment.
Unlike pairwise alignments, it is legitimate in multiple alignments to have a match between two gap characters. By definition , p(-,-) = 0. Although one might be tempted to score a gap as highly negative, when two sequences match at a gap, it indicates that both sequences have a deletion at that position, that is not shared by other sequences. Matching gaps therefore should get a higher score than, for example, a mismatch between two very different amino acids, such as Proline and Lysine.
Generalizing the pairwise dynamic programming algorithm to k sequences is, in essence, doing an alignment in k dimensions. For example, an optimal alignment between k=3 sequences could be visualized in 3 dimensions:

Figure 2 from Zucker [http://www.genetics.wustl.edu/bio5495/1999-course/lecture.7/]
The problem with the dynamic programming algorithm is size and speed. To start with, where k sequences of length n are compared, the storage required for the alignment matrix is O(nk). For 20 sequences of 500 amino acids each, it would require 50020 units of storage (eg. bytes).
Recall that in a pairwise comparison, for each cell in the 2-dimensional array the score was dependent on the 3 preceeding adjacent cells. In a k-dimensional array, there are 2k-1 preceeding cells. (This makes sense: for k=2 ie. a pairwise comparison, 22-1 = 3). Finally, for each column, there are k(k-1)/2 pairwise comparisons.
In summary, the time required for a truely exhaustive multiple alignment, using the most straightforward approach, is O(k22knk).
Figure from Zucker [http://www.genetics.wustl.edu/bio5495/1999-course/lecture.7/]
|
Although some methods have introduced great efficiencies that bring global dynamic programming to a level that is practical for a handful of sequences, more efficient approximate methods are needed for typical alignment problems. Probably the single most important speedup method for dynamic programming methods [Carillo and Lipman (1988) SIAM J. Appl. Math.48:1073-1082, Altschul and Lipman (1989) SIAM J. Appl. Math. 49:197-209]. This method only scales to about 8 sequences, so it is not usually practical.Optimal multiple alignments are an example of NP complete problems. NP complete problems are problems whose solutions can not be done in polynomial time. Polynomial time requires O(n c ) steps, where c is a constant. In contrast, an NP complete problem requires O(nx ) steps, where x is proportional to the size of the dataset. |
![]() |
As illustrated, the
sequences to be aligned are the end nodes, the "leaves"
of the tree.
| Algorithm: Calculate distances
between all possible pairs of sequences
Construct a Neighbor-Joining tree from pairwise distances while not all nodes on the tree have been visited align each pair of
sequences or profiles at the terminal nodes
replace aligned sequences with a profile representing the alignment of all sequences in below that node |
In effect,
this algorithm keeps going deeper and deeper into the tree, clustering
larger and larger groups of sequences. Clusters (profiles) are merged,
until the root of the tree is reached.
A subtle feature that
distinguishes tree alignments from
star
alignments is that distances, between nodes (ie. between sequences) are
calculated, rather than similarities. Distances are the amount
subtracted
from a score, due to amino acid substitutions, when amino acid
sequence
is transformed into another.

from
G. Fuellen, Multiple Alignment. Complexity International
4, 1997.
URL http://journal-ci.csse.monash.edu.au/ci/vol04/mulali/mulali.html.
Again, as in the star alignment, once a gap always a gap.
The efficiency of a tree
alignment is comparable to that of a star alignment.
Morgenstern
B (1999) DIALIGN2: improvement of the segment-to-segment approach to
multiple sequence alignment. Bioinformatics
15:211-218.
Most of the commonly-used multiple sequence alignment programs work in a fashion similar to CLUSTAL and TCOFFEE, buy building the alignment based on a guide tree. There are several problems with this approach:
1.
The guide tree strongly
influences the final alignment.
The final alignment is highly dependent on the order in which sequences
are added to the alignment. The sequences added first have a strong
influence on the final alignment, whereas the sequences added last have
less of an influence. The order of addition is based on a crude
Neighbor Joining tree. Normally, pairwise distances for an NJ tree
would be calculated from a multiple alignment. Since the alignment has
not yet been done, pairwise distances are based on individual pairwise
alignments. Thus, the least robust tree construction method guides the
construction of the alignment.
2.
The guide tree approach
assumes that all parts of all sequences are consistent with a global
alignment.
A global multiple alignment, optimal across all sequences, assumes that
every position (column) in an alignment represents a set of homologous
sites, present in all sequences. This assumption is seldom true in gene
families which have had a chance to undergo insertions, deletions and
rearrangements over time.
To address these
issues, Morgenstern and colleagues have devised a multiple alignment
algorithm which aligns segments of sequence for which strong evidence
for homology exist. This method does not require that each segment be
present in all sequences contributing to the alignment.
DIALIGN constructs
multiple alignments from gap-free pairs of sequence segments, often
referred to as "diagonals". DIALIGN conceptualizes a multiple
alignment as being an alignment of diagonals individual pairwise
alignments. Global alignments align positions, considering each
position in the alignment in isolation from all other positions.
DIALIGN aligns diagonals.
In order to be
included in an alignment, two diagonals from a given pariwise alignment
must be consistent. "A
collection of diagonals is called consistent if
there exists an alignment such that all segment pairs are matched."
| In
Figure 1a, a pairwise
alignment between two sequences contains two high-scoring diagonals. |
![]() a) |
| As
shown in b, it is possible to
break a single diagonal such as f in to several higher-scoring
diagonals. Note that only high-scoring regions of the original diagonal
are included in the smaller diagonals. |
![]() b) |
| Figure
1c shows that better
small diagonals might be obtained from the same sequences by shifting
(ie. equivalent to inserting gaps) off of the main diagonal. |
![]() c) Based on Figure 1 from Morgenstern B (1999) DIALIGN2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 15:211-218. |

"In the second iteration step, those parts of the sequences that are
not yet directly aligned are realigned."
A new diagonal, D5, is discovered, which replaces {D2,D4}. Thus
alignment set M2 = {D1,D5}.

A third
iteration step searches for new diagonals that have not already been
discovered. In the third iteration, if no new diagonals are discovered,
the alignment is considered to be complete. Typically, DIALIGN only
requires three iteration steps.
In essence,
DIALIGN can be thought of as implementing dynamic programming on
diagonals, rather than on individual amino acid or nucleotide
positions. It is therefore quite fast.
The time complexity of DIALIGN is O(N4 x L2), where N is the number of sequences, and L is the maximum length of the sequences.
One of the
advantages of DIALIGN cited by the authors is that unlike other
methods, there are few parameters to adjust. In particular, DIALIGN
does not explicitly consider gaps, and therefore, does not need to
assign gap penalties. This is an advantage, because there is nor
rigorous model for how to calculate gap penalties. With other alignment
methods, the choice of begin_gap and extend_gap penalties is largely an
arbitrary judgement.
| PLNT4610/PLNT7690
Bioinformatics Lecture 6, part 2 of 2 |