| PLNT4610/PLNT7690
Bioinformatics Lecture 7, part 2 of 4 |
|
n |
![]() |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
All 15 tree topologies for 5 species
redrawn from Felsenstein [http://www.cs.washington.edu/education/courses/590bi/98wi/ppt15/sld011.htm ].
Therefore, unless only a small number of sequences are to be included in a tree, methods to avoid considering obviously suboptimal trees must be used to reduce the total number of trees considered. There are two main categories of phylogeny methods, distance methods and character methods. In distance methods, the first step is to calculate a matrix of all pairwise differences between a set of sequences. Next, the tree is constructed to minimize the distance when all branches are added together. Distance methods do not attempt to consider internal branches of the trees, and therefore are not strictly modeled on evolution.
Character methods attempt to reconstruct ancestral nodes of trees in order to fit the tree to an evolutionary model. They therefore use more of the information in the data, at the expense of longer execution time. Character methods include parsimony and maximum likelihood methods.
|
|
||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The simplest scoring method is that of Jukes and Cantor, in which all possible nucleotide substitutions are of equal value. The 2-parameter method of Kimura assigns different weights to transitions and transversions. Typically, transversions are weighted as contributing twice the distance of transitions, since transitions occur more frequently. Finally, maximum likelihood assigns distances based on the Kimura formulae, but weighted according to the probabilities of each possible substitution, as determined from nucleotide frequencies.
Protein substitution scores are calculated in a comparable fashion. One common method is to use Dayhoff's PAM 001 matrix to score distances. (One PAM unit is defined as the amount of sequence divergence corresponding to a 1% amino acid replacement rate.) Alternatively, Kimura's protein distance metric simply uses observed amino acid frequencies from a protein to approximate a PAM distance:
D = -ln (1 - p - 0.2 p2)
where
p is the fraction of amino acids that differ between two sequences
Using the appropriate
scoring methods, all pairwise distances between
sequences are calculated.
For example, the PHYLIP documentation gives the example of a
set of 5 short aligned DNA sequences
Alpha AACGTGGCCACATThe corresponding distance matrix using the Kimura 2 parameter model is
Beta ..G..C......C
Gamma C.GT.C......A
Delta G.GA.TT..G.C.
Epsilon G.GA.CT..G.CC
| Alpha | Beta | Gamma | Delta | Epsilon | |
| Alpha | 0.2997 | 0.7820 | 1.1716 | 1.4617 | |
| Beta | 0.3219 | 0.8997 | 0.5653 | ||
| Gamma | 1.4481 | 1.0726 | |||
| Delta | 0.1679 | ||||
| Epsilon |
| B | C | |
| A | 24 | 28 |
| B | 32 |
Simultaneous linear equations can be used to calculate the branch lengths:
A to B: x + y = 24Thus with 3 equations and 3 unknowns we can calculate that x = 10, y = 14 and z = 18.
A to C: x + z = 28
B to C: y + z = 32
Addition of branches is iterative. Branches are added until all sequences are included in the tree.
Advantages
| Fitch and Margoliash
showed that different sets of internal branch lengths
could be obtained by considering alternate trees which moved one or
more
branches to different parts of the tree. Consider a distance matrix for
four sequences with pairwise distances Dij; |
|
||||||||||||||||||||||||||||||
The Neighbor-Joining tree
for these sequences is

| If we recalculate the
pairwise distances dij from
the tree, they are different from the original distances, as shown at
right. The least squares method of Fitch and Margoliash tries different tree topologies, swapping branches among closely-related sequences, and reculating the distances. For each tree considered, a different matrix of distances will be generated (dij). The best tree is defined as that tree which minimizes:
|
|
||||||||||||||||||||||||||||||
| PLNT4610/PLNT7690
Bioinformatics Lecture 7, part 2 of 4 |