Principles of Microbial Diversity. James W. Brown
Чтение книги онлайн.
Читать онлайн книгу Principles of Microbial Diversity - James W. Brown страница 17
Molecular phylogenetic analysis is the use of macromolecular structure (usually nucleotide or amino acid sequences) to reconstruct the phylogenetic relationships between organisms. The extent of difference between homologous DNA, RNA, or protein sequences in different organisms is used as a measure of how much these organisms have diverged from one another in evolutionary history.
The typical scenario where a phylogenetic analysis is needed is the characterization of a novel organism: for example, determining the phylogenetic placement (phylotype) of a novel organism, in order to make predictions about its unknown properties. This might be a clinical isolate of a potential pathogen, an organism that carries out useful biochemistry, an organism that seems to be abundant in an interesting environment, or anything else of interest.
The process of molecular phylogenetic analysis can be divided into four critical parts, each of which, of course, also has various subparts:
1 1. Decide which organisms and sequences to use in the analysis
2 2. Obtain the required sequence experimentally or from databases
3 3. Assemble these sequences in a multiple-sequence alignment
4 4. Use this alignment to generate phylogenetic trees
In this chapter, we walk through the first three steps of this process.
Deciding which organisms and sequences to use in the analysis
The sequences of genes, RNAs, or proteins contain two very different kinds of information: structural/functional information and historical information. Think of it this way: any particular amino acid in a specific protein is what it is (say, for example, an alanine) in part because it facilitates the formation of the correct structure and function of the protein. But usually there are a number of alternatives that might function just as well. The reason it is what it is, and not any of these alternatives, is that it was inherited from a successful ancestor. This is historical information. Comparisons among an aligned collection of homologous sequences can be used to sort out both the structure of the functional molecule (especially for RNAs) and their historical relationships: a phylogenetic tree.
Phylogenetic trees are usually generated by using alignments of single genes, RNAs, or proteins, but no such sequence is either ideal or universally useful for the generation of informative phylogenetic trees. This being said, some sequences do carry more phylogenetic information than others; these sequences can be called “molecular clocks.”
Features required of a good molecular clock
Clock-like behavior
The sequences of genes, RNAs, and proteins change over time. If this change is entirely random (within the constraint of the structure and function of the molecule; i.e., by genetic drift), the amount of divergence between any particular sequence in two organisms should be a measure of how long ago these organisms diverged from their common ancestor. If this is true, these sequences can be said to exhibit clock-like behavior (Fig. 3.1).
Clock-like behavior depends mostly on functional constancy of the sequence; a change in the function (or functional properties) leads to large, selected (and therefore nonrandom) sequence change, i.e., adaptation. Clock-like behavior also depends on the sequence being long enough to provide statistically significant information and being made up of a large number of independently evolving “bits” so that random changes in one part of the sequence do not influence changes in other parts of the sequence. The sequence must also have an appropriate amount of sequence variation; too little variation does not provide enough difference to be statistically meaningful, whereas too much makes alignment difficult or impossible and decreases the reliability of the treeing algorithm (see chapters 4 and 5 on evolutionary models). Nonfunctional sequences (e.g., some introns) usually change too fast for analysis except of the very closest of relatives.
Figure 3.1 Clock-like behavior. The extent of sequence divergence between a pair of specific sequences should be a measure of how long ago they separated. doi:10.1128/9781555818517.ch3.f3.1
Phylogenetic range
In order to be useful for a phylogenetic analysis, a sequence must be present and identifiable in all of the organisms to be analyzed and must exhibit clock-like behavior within this range. Watch out for gene families, because each member of the family is probably specialized for a slightly different function and it is often difficult to identify the correct ortholog or confirm that it really does have the same function.
Absence of horizontal transfer
Absence of horizontal transfer means that the gene must be acquired only by inheritance from parent to offspring, not by transfer from one organism to another except by descent. Examples of frequently horizontally transferred genes are those encoding antibiotic resistance, but any gene has the potential to be transferred horizontally. You can still generate a tree with sequences that have been horizontally transferred, but if the sequence is otherwise a good molecular clock, the resulting perfectly valid tree will reflect the phylogenetic relationships between the sequences but not the organisms that carry these sequences.
Availability of sequence information
It is of great pragmatic importance to choose a sequence, whenever possible, for which a great deal of the sequence data required is already available and annotated and perhaps already aligned. If you are interested in the phylogenetic placement of organism X, it is better if you do not have to obtain or identify the sequence data yourself for a large number of organisms to which it might (or might not) be related.
The standard: small-subunit ribosomal RNA
In most cases, the best molecular clock for phylogenetic analysis is the small-subunit ribosomal RNA (SSU rRNA) (Fig. 3.2). This sequence is always the best starting point; only after you know where your organism resides in an SSU rRNA phylogenetic tree can you decide what other sequences might provide additional information (see chapter 6 for alternatives).
The SSU rRNA is so often the best sequence of choice for the following reasons.
It is present in all living cells.
It has the same function in all cells.
It comprises 1,500 to 2,000 residues—large enough to be statistically useful but not too large to be onerous to sequence.
Figure 3.2 The Escherichia coli SSU rRNA secondary structure. (Courtesy of Robin Gutell. Adapted from Cannone JJ, Subramanian S, Schnare MN, Collett JR, D’Souza LM, Du Y, Feng B, Lin N, Madabusi LV, Müller KM, Pande N, Shang Z,