Bioinformatics. Группа авторов
Чтение книги онлайн.
Читать онлайн книгу Bioinformatics - Группа авторов страница 52
NM_*
in the name field. The asterisk is a wildcard character that matches any text. Thus, this setting will limit the results to those curated RefSeqs whose name contains the term NM_. (c) Create an intersection between the RefSeq track and the variants from the GWAS Catalog. Click on the intersection button shown in Figure 4.12a and select the appropriate track. The group is Phenotype and Literature and the track is called GWAS Catalog. Leave other selections set to the default. (d) Click on the get output button shown in Figure 4.12a. The output is a list of more than 3000 RefSeq mRNAs that overlap with a variant from the GWAS Catalog. Each RefSeq is hyperlinked to the Genome Browser. (e) The first link is to NM_001042682.1, a transcript of the gene arginine–glutamic acid dipeptide (RE) repeats (RERE). The genomic context of RERE shows the eight SNPs from the GWAS Catalog that it overlaps.
UCSC also provides a related tool called the Data Integrator. The Data Integrator has a more sophisticated intersection function than does the Table Browser, as it can intersect data from up to five separate tracks, and output fields from both the selected tracks and related tables. Thus, for example, output from the Data Integrator could include the gene symbol in addition to the accession number for each transcript on the RefSeq track, along with the dbSNP identifier for the variants in the GWAS Catalog. However, the Data Integrator does not allow for filtering, so it is not possible to restrict the output to only RefSeq mRNA genes.
ENSEMBL Genome Browser
The Ensembl Genome Browser (Cunningham et al. 2019) got its start in 1999 (Hubbard et al. 2002) with the display of the human genome assembly. Like the UCSC Genome Browser, it has grown significantly over the years. The main Ensembl site focuses on vertebrates and includes assemblies from almost 90 species. Ensembl has also created specialized sibling databases for other groups of organisms, including EnsemblPlants (nearly 50 species), EnsemblMetazoa (nearly 70 species), EnsemblProtist (more than 100 species), and EnsemblFungi (more than 800 species), and the very large EnsemblBacteria, with around 44 000 species. The amount of available genome data and annotations varies by organism, but the general browser navigation principles are the same for all. An additional resource is Pre!Ensembl, which displays genomes that are in the process of being annotated. Genomes on this site have an assembly and BLAST interface but, for the most part, no gene predictions.
Like the UCSC Genome Browser, the Ensembl Browser makes available multiple versions of genome assemblies. Integrated into the assemblies may be gene, genome variation, gene regulation, and comparative genomics annotation. Annotations are organized as sets of tracks. Ensembl incorporates data from a variety of public sources, including NCBI, UCSC, model organism databases, and more, and updates data and software in a formal release process, which can be tracked by release number. Importantly, previous Ensembl releases are archived on the web site and are available for view. Thus, even after a genome assembly or annotation set has been updated, it is possible to view the older data using all the regular functions of the Ensembl web site. This archive process sets Ensembl apart from UCSC, where the genome assembly remains stable, but the annotations may change on a weekly basis. Each Ensembl page has a link at the bottom called View in archive site. The archive site provides links to older versions of that page, including previous annotation sets on the same genome assembly, as well as prior genome assemblies.
The Ensembl Browser provides many of the same types of resources and tools as does the UCSC Genome Browser. Sequences can be aligned to the assembled genomes using either BLAT or BLAST, and data can be returned in various tabular formats using BioMart (Kinsella et al. 2011). Data and software can be retrieved from the Downloads menu, available from most browser pages. In the Tools menu, Ensembl provides a number of additional tools to manipulate data, including the Variant Effect Predictor (VEP) (McLaren et al. 2016), which predicts functional consequences of known and unknown variants, File Chameleon, which reformats files available on the Ensembl FTP site, and Assembly Converter, which is like UCSC's liftOver and is used to convert coordinates between genome assemblies. The Help & Documentation menu provides substantial written and video-based information about how to navigate and interpret the Ensembl site, far beyond the level of detail presented in this chapter.
Ensembl also provides ways for users to upload their data into the browser. Properly formatted tracks can be added to the display by selecting the Custom tracks option from the left side of any species-specific page. The data can be uploaded to Ensembl from a file on the user's computer or, if it is saved on a web server, the browser can read it from a URL. Users who create an account at Ensembl can save track data to the Ensembl database server and view them later from any computer. To share custom tracks or even a customized view of the Genome Browser with colleagues, click on the Share this Page link on the left sidebar. Ensembl also supports Track Hubs, both public ones that are registered on the EMBL-EBI Track Hub Registry as well as private ones.
Figure 4.13 The home page of the Ensembl Genome Browser, showing a query for the human gene PAH. The browser suggests results based on the search term submitted. By default, the search box interfaces with the most recent version of the genome assembly, GRCh38, at the time of this writing. A link to the previous human genome assembly, GRCh37, is provided at the bottom of the page. Older assemblies from other organisms are available in the Ensembl archives.
Figure 4.14 The Gene tab for the human PAH gene. This landing page provides links to many gene-specific resources.
Like the UCSC Genome Browser home page, the home page of Ensembl is a stepping-off point for many Ensembl resources. Links to commonly used tools, such as BLAST and BLAT, are provided on the top and middle sections of the page, and recent data updates are highlighted in the right column. The home page for each genome can be accessed by selecting the organism name in the pull-down menu in the Browse a Genome section in the center of the page. A search box at the top of the page provides access to Ensembl. To search for the human PAH gene, select Human from the pull-down menu and type the term PAH
in the search box. Ensembl will provide several suggested hits, including a direct link to the human PAH gene (Figure 4.13).
Ensembl data displays are organized in tabs. The Gene tab (Figure 4.14) has links to a number of gene-specific views and resources. For