Bioinformatics. Группа авторов
Чтение книги онлайн.
Читать онлайн книгу Bioinformatics - Группа авторов страница 48
1 Hide: the track is not displayed at all.
2 Dense: all features are collapsed into a single line; features are not labeled.
3 Squish: each feature is shown separately, but at 50% the height of full mode; features are not labeled.
4 Pack: each feature is shown separately, but not necessarily on separate lines; features are labeled.
5 Full: each feature is labeled and displayed on a separate line.
Figure 4.2 The default view of the UCSC Genome Browser, showing the genomic context of the human HIF1A gene.
In order to simplify the display, most tracks are in hide mode by default. To change the mode, use the pull-down menu below the track name or on the Track Settings page. Other settings, such as color or annotation details, can also be configured on the Track Settings page. For example, the NCBI RefSeq track allows users to select if they want to view all reference sequences or only those that are curated or predicted (Box 1.2). One possible point of confusion is that the UCSC Genome Browser will “remember” the mode in which each track is displayed from session to session. Custom settings can be cleared by selecting Reset all User Settings under the Genome Browser pull-down menu at the top of any page.
The annotation tracks in the window below the chromosome are the focus of the Genome Browser (Figure 4.2). Tracks are depicted horizontally, with a title above the track and labels on the left. The first two lines show the scale and chromosomal position. The term that was searched for and matched (HIF1A in this case) is highlighted on the annotation tracks. The next tracks shown by default are gene prediction tracks. The default gene track on GRCh38 is the GENCODE Genes set, which replaces the UCSC Genes track that is still displayed on GRCh37 and older human assemblies. GENCODE genes are annotated using a combination of computational analysis and manual curation, and are used by the ENCODE Consortium and other groups as reference gene sets (Box 4.2). The GENCODE v24 track depicts all of the gene models from the GENCODE v24 release, which includes both protein-coding genes and non-coding RNA genes.
Box 4.2 GENCODE
The GENCODE gene set was originally developed by the ENCODE Consortium as a comprehensive source of high-quality human gene annotations (Harrow et al. 2012). It has now been expanded to include the mouse genome (Mudge and Harrow 2015). The goal of the GENCODE project is to include all alternative splice variants of protein-coding loci, as well as non-coding loci and pseudogenes. The GENCODE Consortium uses computational methods, manual curation, and experimental validation to identify these gene features. The first step is carried out by the same Ensembl gene annotation pipeline that is used to annotate all vertebrate genomes displayed at Ensembl (Aken et al. 2016). This pipeline aligns cDNAs, proteins, and RNA-seq data to the human genome in order to create candidate transcript models. All Ensembl transcript models are supported by experimental evidence; no models are created solely from ab initio predictions. The Human and Vertebrate Analysis and Annotation (HAVANA) group produces manually curated gene sets for several vertebrate genomes, including mouse and human. These manually curated genes are merged with the Ensembl transcript models to create the GENCODE gene sets for mouse and human. A subset of the human models has been confirmed by an experimental validation pipeline (Howald et al. 2012).
The consortium makes available two types of GENCODE gene sets. The Comprehensive set encompasses all gene models, and may include many alternatively spliced transcripts (isoforms) for each gene. The Basic set includes a subset of representative transcripts for each gene that prioritizes full-length protein-coding transcripts over partial- or non-protein-coding transcripts. The Ensembl Genome Browser displays the Comprehensive set by default. Although the UCSC Genome Browser displays the Basic set by default, the Comprehensive set can be selected by changing the GENCODE track settings. At the time of this writing, Ensembl is displaying GENCODE v27, released in August 2017. The GENCODE version available by default at the UCSC Genome Browser is v24, from December 2015. More recent versions of GENCODE can be added to the browser by selecting them in the All GENCODE super-track.
GENCODE and RefSeq both aim to provide a comprehensive gene set for mouse and human. Frankish et al. (2015) have shown that, in human, the RefSeq gene set is more similar to the GENCODE Basic set, while the GENCODE Comprehensive set contains more alternative splicing and exons, as well as more novel protein-coding sequences, thus covering more of the genome. They also sought to determine which gene set would provide the best reference transcriptome for annotating variants. They found that the GENCODE Comprehensive set, because of its better genomic coverage, was better for discovering new variants with functional potential, while the GENCODE Basic set may be better suited for applications where a less complex set of transcripts is needed. Similarly, Wu et al. (2013) compared the use of different gene sets to quantify RNA-seq reads and determine gene expression levels. Like Frankish et al., they recommend using less complex gene annotations (such as the RefSeq gene set) for gene expression estimates, but more complex gene annotations (such as GENCODE) for exploratory research on novel transcriptional or regulatory mechanisms.
In the GENCODE track, as well as other gene tracks, exons (regions of the transcript that align with the genome) are depicted as blocks, while introns are drawn as the horizontal lines that connect the exons. The direction of transcription is indicated by arrowheads on the introns. Coding regions of exons are depicted as tall blocks, while non-coding exons are shorter. In this example, the GENCODE track depicts five alternatively spliced transcripts, labeled HIF1A on the left, for the HIF1A gene. As shown by the arrowheads, all transcripts are transcribed from left to right. The 5′-most exon of each transcript (on the left side of the display) is shorter on the left, indicating an untranslated region (UTR), and taller on the right, indicating a coding sequence. The reverse is true for the 3′-most exon of each transcript. A very close visual inspection of the Genome Browser shows that the last four HIF1A transcripts have a different pattern of exons from each other; a BLAST search (not shown) reveals that first two transcripts differ by only three nucleotides in one exon. There is also a transcript labeled HIF1A-AS2, an anti-sense HIF1A transcript that is transcribed from right to left. Another transcript, labeled RP11-618G20.1, is a synthetic construct DNA. Zooming the display out by 3× allows a view of the genes immediately upstream and downstream of HIF1A (Figure 4.3). A second HIF1A antisense transcript, HIF1A-AS1, is also visible.
The track below the GENCODE track is the RefSeq gene predictions from NCBI track. This is a composite track showing human protein-coding and non-protein-coding genes taken from the NCBI RNA reference sequences collection (RefSeq; Box 1.2). By default, the RefSeq track is shown in dense mode, with the exons of the individual transcripts condensed into a single line (Figure 4.2). Note that, in this dense mode, the exons are displayed as blocks, as