last  page PLNT4610/PLNT7690 Bioinformatics
Lecture 7, part 2 of 4
next page

B. Methods for tree building

Before discussing common tree building methods, it is instructive to consider the possibility of simply building all possible trees and choosing the best one. For n sequences, the number of possible trees is given in the table. This function increases faster than n!.
 
# of sequences
n
# of trees
3
1
4
3
5
15
6
105
7
945
8
10,395
9
135,135
10
1,027,025
50
2.8 x 1074

All 15 tree topologies for 5 species

 

redrawn from Felsenstein [http://www.cs.washington.edu/education/courses/590bi/98wi/ppt15/sld011.htm ].

Therefore, unless only a small number of sequences are to be included in a tree, methods to avoid considering obviously suboptimal trees must be used to reduce the total number of trees considered. There are two main categories of phylogeny methods, distance methods and character methods. In distance methods, the first step is to calculate a matrix of all pairwise differences between a set of sequences. Next,  the tree is constructed to minimize the distance when all branches are added together. Distance methods do not attempt to consider internal branches of the trees, and therefore are not strictly modeled on evolution.

Character methods attempt to reconstruct ancestral nodes of trees in order to fit the tree to an evolutionary model. They therefore use more of the information in the data, at the expense of longer execution time. Character methods include parsimony and maximum likelihood methods.

1. Distance matrix methods (MD)

Calculation of distance matrices

In general, DNA distance matrices are calculated such that each mismatch between two sequences adds to the distance, and each identity subtracts from the distance. Scoring matrices include values for all possible substitutions.
 
General substitution matrix 
 
A
C
G
T
     A
-(a1+a2+a3)
a1
a2
a3
     C
a4
-(a4+a5+a6)
a5
a6
     G
a7
a8
-(a7+a8+a9)
a9
     T
a10
a11
a12
-(a10+a11+a12

The simplest scoring method is that of Jukes and Cantor, in which all possible nucleotide substitutions are of equal value. The 2-parameter method of Kimura assigns different weights to transitions and transversions. Typically, transversions are weighted as contributing twice the distance of transitions, since transitions occur more frequently.  Finally, maximum likelihood assigns distances based on the Kimura formulae, but weighted according to the probabilities of each possible substitution, as determined from nucleotide frequencies.

Protein substitution scores are calculated in a comparable fashion. One common method is to use Dayhoff's PAM 001 matrix to score distances.  (One PAM unit is defined as the amount of sequence divergence corresponding to a 1% amino acid replacement rate.) Alternatively, Kimura's protein distance metric simply uses observed amino acid frequencies from a protein to approximate a PAM distance:

D = -ln (1 - p - 0.2 p2)

where p is the fraction of amino acids that differ between two sequences

Using the appropriate scoring methods, all pairwise distances between sequences are calculated.
For example, the PHYLIP documentation  gives the example of a set of 5 short aligned DNA sequences

Alpha        AACGTGGCCACAT
Beta         ..G..C......C
Gamma        C.GT.C......A
Delta        G.GA.TT..G.C.
Epsilon      G.GA.CT..G.CC
The corresponding distance matrix using the Kimura 2 parameter model is


Alpha Beta Gamma Delta Epsilon
Alpha
0.2997 0.7820 1.1716 1.4617
Beta

0.3219 0.8997 0.5653
Gamma


1.4481 1.0726
Delta



0.1679
Epsilon





The Neighbor-Joining method (NJ)

The Neighbor -Joining method is one of the simplest distance methods. It begins by choosing the two most closely-related sequences, and then adding the next most distant sequence as a third branch to the tree. Fitch and Margoliash give a simple example for a tree with 3 sequences A,B and C and the distances between nodes x, y and z:

   


B C
A 24 28
B
32

Simultaneous linear equations can be used to calculate the branch lengths:

A to B:  x + y = 24
A to C:  x + z = 28
B to C:  y + z = 32
Thus with 3 equations and 3 unknowns we can calculate that x = 10,  y = 14 and z = 18.

Addition of branches is iterative. Branches are added until all sequences are included in the tree.

Advantages

The Fitch/Margoliash Least Squares Method

The Neighbor-Joining method only attempts to build one tree. However, the raw pairwise distances may not always be perfectly additive. The ideal example shown above was internally consistent. In the example, the sums of the 3 simultaneous equations (ie. 2 x the sums of the branch lengths) were precisely equal to the sums of the pairwise distances. This will not always be true for real data. In part this is due to undetected homoplasy.

Fitch and Margoliash showed that different sets of internal branch lengths could be obtained by considering alternate trees which moved one or more branches to different parts of the tree. Consider a distance matrix for four sequences with pairwise distances Dij;
 

Observed distances
Dij
 
A
B
C
D
 A
0
0.16
0.38
1.18
 B
0.16
0
0.49
0.93
 C
0.38
0.49
0
0.91
 D
1.18
0.93
0.91
0

The Neighbor-Joining tree for these sequences is

If we recalculate the pairwise distances dij from the tree, they are different from the original distances, as shown at right.

The least squares method of Fitch and Margoliash tries different tree topologies, swapping branches among closely-related sequences, and reculating the distances. For each tree considered, a different matrix of distances will be generated (dij). The best tree is defined as that tree which minimizes:

Distances recomputed from the tree
dij
 
A
B
C
D
 A
0
0.16
0.47
1.09
 B
0.16
0
0.40
1.02
 C
0.47
0.40
0
0.91
 D
1.09
1.02
0.91
0
Advantages Disadvantages
last  page PLNT4610/PLNT7690 Bioinformatics
Lecture 7, part 2 of 4
next page