Bioinformatics. Группа авторов
Чтение книги онлайн.
Читать онлайн книгу Bioinformatics - Группа авторов страница 42
![Bioinformatics - Группа авторов Bioinformatics - Группа авторов](/cover_pre848282.jpg)
Figure 3.15 Selecting algorithm parameters for a PSI-BLAST search. See text for details.
As an example, consider a case where an investigator wishes to map a cDNA clone coming from the Cancer Genome Anatomy Project (CGAP) to the rat genome. The BLAT query page is shown in Figure 3.18, and the sequence of the clone of interest has been pasted into the sequence box. Above the sequence box are several pull-down menus that can be used to specify which genome should be searched (organism), which assembly should be used (usually, the most recent), and the query type (DNA, protein, translated DNA, or translated RNA). Once the appropriate choices have been made, the search is commenced by pressing the “Submit” button. The results of the query are shown in the upper panel of Figure 3.19; here, the hit with the highest score is shown at the top of the list, a match having 98.1% identity with the query sequence. More details on this hit can be found by clicking the “details” hyperlink, to the left of the entry. A long web page is then returned, providing information on the original query, the genomic sequence, and an alignment of the query against the found genomic sequence (Figure 3.19, bottom panel). The genomic sequence here is labeled chr5, meaning that the query corresponds to a region of rat chromosome 5. Matching bases in the cDNA and genomic sequences are colored in dark blue and are capitalized. Lighter blue uppercase bases mark the boundaries of aligned regions and often signify splice sites. Gaps and unaligned regions are indicated by lower case black type. In the Side by Side Alignment, exact matches are indicated by the vertical line between the two sequences. Clicking on the “browser” hyperlink in the upper panel of Figure 3.19 would take the user to the UCSC Genome Browser, where detailed information about the genomic assembly in this region of rat chromosome 5 (specifically, at 5q31) can be obtained (cf. Chapter 4).
Figure 3.16 Results of the first round of a PSI-BLAST search. For each sequence found, the user is presented with the definition line from the corresponding UniProtKB/Swiss-Prot entry, the score value for the best high-scoring segment pair (HSP) alignment, the total of all scores across all HSP alignments, the percentage of the query covered by the HSPs, and the E value and percent identity for the best HSP alignment. The hyperlinked accession number allows for direct access to the source database record for that hit. Sequences whose “Select for PSI blast” box are checked will be used to calculate a position-specific scoring matrix (PSSM), and that PSSM then serves as the new “query” for the next round, the results of which are shown in Figure 3.17.
Figure 3.17 Results of the second round of a PSI-BLAST search. New sequences identified through the use of the position-specific scoring matrix (PSSM) calculated based on the results shown in Figure 3.16 are highlighted in yellow. Check marks in the right-most column indicate which sequences were used to build the PSSM producing these results.
Figure 3.18 Submitting a BLAT query. A rat clone from the Cancer Genome Anatomy Project Tumor Gene Index (CB312815) is the query. The pull-down menus at the top of the page can be used to specify which genome should be searched (organism), which assembly should be used (usually, the most recent), and the query type (DNA, protein, translated DNA, or translated RNA). The “I'm feeling lucky” button returns only the highest scoring alignment and provides a direct path to the UCSC Genome Browser.
FASTA
While the most commonly used technique for detecting similarity between sequences is BLAST, it is not the only heuristic method that can be used to rapidly and accurately compare sequences with one another. In fact, the first widely used program designed for database similarity searching was FASTA (Lipman and Pearson 1985; Pearson and Lipman 1988; Pearson 2000). Like BLAST, FASTA enables the user to rapidly compare a query sequence against large databases, and various versions of the program are available (Table 3.3). In addition to the main implementations, a variety of specialized FASTA versions are available, described in detail in Pearson (2016). An interesting historical note is that the FASTA format for representing nucleotide and protein sequences originated with the development of the FASTA algorithm.
Figure 3.19 Results of a BLAT query. Based on the query submitted in Figure 3.18, the highest scoring hit is to a sequence on chromosome 5 rat genome having 98.1% sequence identity. Clicking on the “details” hyperlink brings the user to additional information on the found sequence, shown in the lower panel. Matching bases in the cDNA and genomic sequences are colored in dark blue and are capitalized. Lighter blue uppercase bases mark the boundaries of aligned regions and often signify splice sites. Gaps are indicated by lowercase black type. In the side-by-side alignment, exact matches are indicated by the vertical line between the sequences.
Table 3.3 Main FASTA algorithms.
Program | Query | Database | Corresponding BLAST Program |
FASTA | Nucleotide | Nucleotide | BLASTN |
Protein | Protein | BLASTP | |
FASTX/FASTY | DNA | Protein | BLASTX |
TFASTYX/TFASTY | Protein | Translated DNA | TBLASTN |
The Method
The FASTA algorithm can be divided into four major steps. In the first step, FASTA determines all overlapping words of a certain length both in the query sequence