Principles of Microbial Diversity. James W. Brown
Чтение книги онлайн.
Читать онлайн книгу Principles of Microbial Diversity - James W. Brown страница 20
7 7. Although RNA three-dimensional structures are scarce, there are hundreds of protein three-dimensional structures, determined by X-ray crystallography. Can you imagine a way to use these structures, analogous to the use of RNA secondary structures, to align protein sequences more meaningfully?
4
Constructing a Phylogenetic Tree
In chapter 3, we covered the first three steps of a phylogenetic analysis, leaving the final step toward which the others build. The steps in a phylogenetic analysis are as follows:
1 1. Decide which gene and species to analyze (small-subunit ribosomal RNA [SSU rRNA])
2 2. Determine the gene sequences (polymerase chain reaction [PCR] and DNA sequencing, database “mining”)
3 3. Identify homologous residues (sequence alignment)
4 4. Perform the phylogenetic analysis
The most common type of phylogenetic analysis is tree construction. A tree is nothing more than a graph representing the similarity relationships between the sequences in an alignment. This is why we’ll be going through this process in such detail, to show that tree construction is not rocket science but involves straightforward mathematical transformations of sequence data.
There are several methods for building trees. In this chapter, we cover the neighbor-joining method in some detail as an example, because it is conceptually straightforward and commonly used. In the next chapter, we briefly cover some other approaches.
Tree construction: the neighbor-joining method
Tree construction starts with an alignment. Neighbor joining is a distance matrix method, meaning that the alignment is first reduced to a table of evolutionary distances, a distance matrix. The distance matrix cannot be generated directly from the alignment, however, because actual evolutionary distance cannot be directly measured. Instead, the alignment is reduced to a table of observed (measurable) similarity, the similarity matrix. The distance matrix is calculated from the similarity matrix, and then the tree is generated from the distance matrix.
Generating a similarity matrix
The similarity matrix is just a table of fractional similarities, for example, in this alignment of six sequences with 20 positions.
Just count the fraction of identical bases in every pair of sequences in the alignment.
The similarity values for all pairs of sequences are calculated in the same way and assembled into a table:
In this example, sequences A and B are 0.90 (90%) similar, A and C are 0.75 similar, B and C are 0.75 similar, and so forth. Note that values on the diagonal (A:A, B:B, …) do not need to be calculated; they are always 1. Likewise, there is no reason to calculate both above and below the diagonal; the value for X:Y is the same as that for Y:X, so the second calculation would be redundant.
Converting a similarity matrix into an evolutionary distance matrix
Next is the estimation of evolutionary distances from their sequence similarity. You might think that the distance would just be 1 − similarity (i.e., “difference”), and you would be right except that the number of differences you count between any two sequences misses some of the changes that probably have occurred between them. More than one evolutionary change at a single position (e.g., A to G to U, or A to G in one sequence and the same A to U in another) counts as only one difference between the two sequences, and in the case of reversion or convergence it counts as no change at all (e.g., A to G to A, or A to G in one organism and the same A to G in another). As a result, the observed similarity between two sequences underestimates the evolutionary distance that separates them.
One common way to estimate evolutionary distances from similarity is the Jukes and Cantor method, which uses the following equation:
As shown graphically in Fig. 4.1, similarity and distance are very closely related initially (e.g., 0.90 similarity ≈ 0.10 distance) but level off to 0.25 similarity, where evolutionary distance is infinite. This makes sense; for two sequences that are very similar, the probable frequency of more than one change at a single site is low, requiring only a small correction, whereas two sequences that have changed beyond all recognition (infinite evolutionary distance) are still approximately 25% similar just because there are only four bases and so approximately one of the four will match entirely by chance.
Figure 4.1 The Jukes and Cantor equation plotted as observed sequence similarity (from the similarity matrix) versus estimated evolutionary distance. doi:10.1128/9781555818517.ch4.f4.1
To convert a similarity matrix to a distance matrix, just convert each value in the similarity matrix to evolutionary distance using either the graph or the equation. In our example:
Generating a tree from a distance matrix
In the neighbor-joining method, the structure of the tree is determined first and then the branch lengths are fit to this skeleton.
Solving the tree structure
The tree starts out with a single internal node and a branch out to each sequence: an n-pointed star, where n is the number of sequences in the alignment. The pair of sequences with the smallest evolutionary distance separating them is joined onto a single branch (i.e., the neighbors are joined, hence the name of the method), and then the process is repeated after merging these two sequences in the distance matrix by averaging their distances from every other sequence in the matrix.
Using our distance matrix, the tree starts out like this (remember that we are sorting out the structure of the tree, not yet the branch lengths).
The closest neighbors in the distance matrix are A and B (0.11 evolutionary distance), so these branches are joined:
The