Bioinformatics. Группа авторов
Чтение книги онлайн.
Читать онлайн книгу Bioinformatics - Группа авторов страница 53
Figure 4.15 Computationally predicted orthologs of the human PAH gene, from the Comparative Genomics → Orthologues link in Figure 4.14. Ensembl provides a detailed analysis of the orthologs calculated for each gene. Orthologs are grouped by species, such as primates, rodents, and sauropsids. Links to individual orthologs are shown at the bottom of the page.
The default human gene set used by Ensembl is the GENCODE Comprehensive set (Box 4.2). Ensembl displays 18 PAH isoforms, each with a slightly different pattern of exons (Figure 4.16). Coding exons are depicted as solid blocks, non-coding exons as outlined blocks, and introns are the lines that connect them. The transcripts are color coded to indicate their status: gold transcripts are protein coding and have been annotated by both the Ensembl and HAVANA team at the WTSI, red transcripts are protein coding and have been annotated by either Ensembl or HAVANA, and blue transcripts are processed transcripts that are non-protein coding. Clicking on a transcript pops up a box with additional information about that feature, including its accession number, and, for a transcript, the transcript type and gene prediction source (Box 4.4; Figure 4.16).
Figure 4.16 The Location tab for the human PAH gene. The Location tab is divided into three sections. The top section shows a cartoon of human chromosome 12, with the region surrounding the PAH gene outlined in a red box. Other red and green lines on the cartoon indicate assembly exceptions, or regions of alternative sequence that differ from the primary assembly because of allelic sequence or incorrect sequence, as determined by the Genome Reference Consortium. The Region in detail shows a zoomed-in view of the region outlined by the red box in the top section of the page. Genes are indicated by rectangles, colored as described in the gene legend below the graphic. The gene identifiers, along with the direction of transcription, are shown below the rectangles. The bottom section shows a zoomed-in view of the region surrounded by the red box in the Region in detail. The blue bar represents the genomic contig in this region. In the Genes track, genes above the bar are transcribed from left to right; those below the contig are transcribed from right to left. A few of the PAH transcripts, which are transcribed from right to left, are visible in this view. Gold transcripts are merged HAVANA/Ensembl transcripts; red are Ensembl protein-coding transcripts; blue transcripts are non-protein-coding processed transcripts. The pop-up display, activated when clicking on a particular transcript, shows the details for the first transcript in the Genes track, PAH-215.
Box 4.4 Ensembl Stable IDs
Ensembl assigns accession numbers to many data types in its database. Each identifier begins with the organism prefix; for human, the prefix is ENS
; for mouse, it is ENSMUS
; and for anole lizard, it is ENSACA.
Next comes an abbreviation for the feature type: G
for gene, T
for transcript, P
for protein, R
for regulatory, and so forth. This is followed by a series of digits, and an optional version. The version number increments when there is a change in the underlying data. The gene version changes when the underlying transcripts are updated, and the transcript and protein versions increment when the sequence changes.
For example, the human PAH gene has the following identifiers:
ENSG00000171759.9: the identifier of the human PAH gene
ENST00000553106.5: the identifier of one transcript of the human PAH gene, transcript PAH-215
ENSP00000448059.1: the identifier of the protein translation of transcript PAH-215, ENST00000553106.5
ENSR00000056420: the identifier of a promoter of several PAH transcripts
Navigation controls between the second and third panels of the Location tab allow the display to be zoomed or moved to the left or right. The blue bar at the top of the Region in detail allows users to toggle between Drag and Select. When the Drag option is highlighted, click on the graphical view window and drag it to the left or right to change the location. When the Select option is highlighted, click on a region of interest in the graphical view, then, holding the mouse button down, scroll to the left or right to highlight the region (Figure 4.17a). The highlight can be left on for visualization purposes or, alternatively, select Jump to region to zoom in to the selected region. Figure 4.17b shows the results of zooming in to the last exon of transcript PAH-203; since the gene is transcribed from right to left, the last exon is on the left. Note the track called All phenotype-associated short variants (SNPs and indels) that contains those variants that have been associated with a phenotype or disease. SNPs are color coded by function, with dark green indicating coding sequence variants. Select the dark green SNP, highlighted with a red box near the left end of the window, and follow the link for additional information. The resulting Variant tab provides links to SNP-related resources. For example, the Phenotype Data for this SNP (rs76296470; Figure 4.18a) shows that this variant is pathogenic and is associated with the disease phenylketonuria. The most severe consequence for this SNP is a stop gained. Further details about the consequences are available under