Bioinformatics. Группа авторов
Чтение книги онлайн.
Читать онлайн книгу Bioinformatics - Группа авторов страница 47
The home page of the UCSC Genome Browser provides a stepping-off point for many of the resources developed by the Genome Bioinformatics group at UCSC, including the Genome Browser, BLAT, and the Table Browser, which will be described in detail later in this chapter. The Tools menu provides a link to liftOver, a widely used tool that converts genomic coordinates from one assembly to another. Using this tool, it is possible to update annotation files so that old data can be integrated into a new genome assembly. The Download menu provides an option to download all the sequence and annotation data for each genome assembly hosted by UCSC, as well as some of the source code. The What's New section provides updates on new genome assemblies, as well as new tools and features. Finally, there is an extensive Help menu, with detailed documentation as well as videos. Users may also submit questions to a mailing list, and most queries are answered within a day.
The UCSC Genome Browser provides multiple ways for both individual users and larger genome centers to share data with collaborators or even the entire bioinformatics community. These sharing options are available on the My Data link on the home page. Custom Tracks allow users to display their own data as a separate annotation track in the browser. User data must be formatted in a standard data structure in order to be interpreted correctly by the browser. Many commonly used file formats are supported, including Browser Extensible Data (BED), Binary Alignment/Map (BAM), and Variant Call Format (VCF; Box 4.1). Small data files can be uploaded or pasted into the Genome Browser for personal use. Larger files must be saved on the user's web server and accessed by URL through the Genome Browser. As anyone with the URL can access the data, this method can be used to share data with collaborators. Alternatively, Custom Tracks, along with track configurations and settings, can be shared with selected collaborators using a named Session. Some groups choose to make their Sessions available to the world at large in My Data → Public Sessions. Finally, groups with very large datasets can host their data in the form of a Track Hub so that it can be viewed on the UCSC Genome Browser. When a Track Hub is paired with an Assembly Hub, it can be used to create a browser for a genome assembly not already hosted by UCSC.
Box 4.1 Common File Types for Genomic Data
Both the UCSC and Ensembl Genome Browsers allow users to upload their own data so that they can be viewed in context with other genome-scale data. User data must be formatted in a commonly used data structure in order to be interpreted correctly by the browser.
Browser Extensible Data (BED) format is a tab-delimited format that is flexible enough to display many types of data. It can be used to display fairly simple features like the location of transcription binding factor sites, as well more complex ones like transcripts and their exons.
Binary Alignment/Map (BAM) format is the compressed binary version of the Sequence Alignment/Map (SAM) format. It is a compact format designed for use with very large files of nucleotide sequence alignments. Because it can be indexed, only the portion of the file that is needed for display is transferred to the browser. Many tools for next generation sequence analysis use BAM format as output or input.
Variant Call Format (VCF) is a flexible format for large files of variation data including single-nucleotide variants, insertions/deletions, copy number variants, and structural variants. Like BAM format, it is compressed and indexed, and only the portion of the file that is needed for display is transferred to the browser. Many tools for variant analysis use VCF format as output or input.
The UCSC Genome Browser home page lists commonly accessed tools, as well as a frequently updated news section that highlights major data and software updates. To reach the Genome Browser Gateway, the main entry point for text-based searches, click on the Gateway link on the home page (Figure 4.1). The default assembly is the most recent human assembly, GRCh38, from December 2013. The genomes of other species can be selected from the phylogenetic tree on the left side of the Gateway page, or by typing their name in the selection box. On the human Gateway page, there is also the option to select one of four older human genome assemblies. Details about the GRCh38 assembly and instructions for searching are available on the Gateway page.
To perform a search, enter text into the Position/Search Term box. If the query maps to a unique position in the genome, such as a search for a particular chromosome and position, the Go button links directly to the Genome Browser. However, if there is more than one hit for the query, such as a search for the term metalloprotease
, the resulting page will contain a list of results that all contain that term. For some species, the terms have been indexed, and typing a gene symbol into the search box will bring up a list of possible matches. In this example, we will search for the human hypoxia inducible factor 1 alpha subunit (HIF1A) gene (Figure 4.1), which produces a single hit on GRCh38.
The default Genome Browser view showing the genomic context of the HIF1A gene is shown in Figure 4.2. The navigation controls are presented across the top of the display. The arrows move the window to the left and right along the chromosome. Alternatively, the user can move the display left and right by holding down the mouse button and dragging the window. To zoom in and out, use the buttons at the top of the display. The base button zooms in so far that individual nucleotides are displayed, while the zoom out 100× button will show the entire chromosome if it is pressed a few times. The current genomic position and the length of window (in nucleotides) is shown above a schematic of chromosome 14, where the current genomic position is highlighted with a red box. A new search term can be entered into the search box.
Figure 4.1 The home page of the UCSC Genome Browser, showing a query for the gene HIF1A on the human GRCh38 genome assembly. The organism can be selected by clicking on its name in the phylogenetic tree. For many organisms, more than one genome assembly is available. Typing a term into the Position/Search Term box returns a list of matching gene symbols.
Below the browser window illustrated in Figure 4.2, one would find a list of tracks that are available for display on the assembly. The tracks are separated