Bioinformatics. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Bioinformatics - Группа авторов страница 36

Bioinformatics - Группа авторов

Скачать книгу

or motifs. The sequences being compared here are of average composition based on the small number of protein sequences available in 1978, so there is a bias toward small, globular proteins, even though efforts have been made to bring in additional sequence data over time (Gonnet et al. 1992; Jones et al. 1992). Finally, there is an implicit assumption that the forces responsible for sequence evolution over shorter time spans are the same as those for longer evolutionary time spans. Although there are significant drawbacks to the PAM matrices, it is important to remember that, given the information available in 1978, the development of these matrices marked an important advance in our ability to quantify the relationships between sequences. As these matrices are still available for use with numerous bioinformatic tools, the reader should keep these potential drawbacks in mind and use them judiciously.

      BLOSUM Matrices

      In 1992, Steve and Jorja Henikoff took a slightly different approach to the one described above, one that addressed many of the drawbacks of the PAM matrices. The groundwork for the development of new matrices was a study aimed at identifying conserved motifs within families of proteins (Henikoff and Henikoff 1991, 1992). This study led to the creation of the BLOCKS database, which used the concept of a block to identify a family of proteins. The idea of a block is derived from the more familiar notion of a motif, which usually refers to a conserved stretch of amino acids that confers a specific function or structure to a protein. When these individual motifs from proteins in the same family can be aligned without introducing a gap, the result is a block, with the term block referring to the alignment, not the individual sequences themselves. Obviously, any given protein can contain one or more blocks, corresponding to each of its structural or functional motifs. With these protein blocks in hand, it was then possible to look for substitution patterns only in the most conserved regions of a protein, the regions that (presumably) were least prone to change. Two thousand blocks representing more than 500 groups of related proteins were examined and, based on the substitution patterns in those conserved blocks, blocks substitution matrices (or BLOSUMs, for short) were generated.

      Returning to the point of directly deriving the various matrices, each BLOSUM matrix is assigned a number (BLOSUMn), and that number represents the conservation level of the sequences that were used to derive that particular matrix. For example, the BLOSUM62 matrix is calculated from sequences sharing no more than 62% identity; sequences with more than 62% identity are clustered and their contribution is weighted to 1. The clustering reduces the contribution of closely related sequences, meaning that there is less bias toward substitutions that occur (and may be over-represented) in the most closely related members of a family. Reducing the value of n yields more distantly related sequences.

      Which Matrices Should be Used When?

       PAM250 is equivalent to BLOSUM45

       PAM160 is equivalent to BLOSUM62

       PAM120 is equivalent to BLOSUM80.

      In addition to the protein matrices discussed here, there are numerous specialized matrices that are either specific to a particular species, concentrate on particular classes of proteins (e.g. transmembrane proteins), focus on structural substitutions, or use hydrophobicity measures in attempting to assess similarity (see Wheeler 2003). Given this landscape, the most important take-home message for the reader is that no single matrix is the complete answer for all sequence comparisons. A thorough understanding of what each matrix represents is critical to performing proper sequence-based analyses.

Matrix Best use Similarity
PAM40 Short alignments that are highly similar 70–90%
PAM160 Detecting members of a protein family 50–60%
PAM250 Longer alignments of more divergent sequences ∼30%
BLOSUM90 Short alignments that are highly similar 70–90%
BLOSUM80 Detecting members of a protein family 50–60%
BLOSUM62 Most effective in finding all potential similarities 30–40%
BLOSUM30 Longer alignments of more divergent sequences <30%

      The Similarity column gives the range of similarities that the matrix is able to best detect (Wheeler 2003).

      Nucleotide Scoring Matrices

      Gaps and Gap Penalties

      Often times, gaps are introduced to improve the alignment between two nucleotide or protein sequences. These gaps compensate for insertions and deletions between the sequences being studied so,

Скачать книгу