Bioinformatics. Группа авторов
Чтение книги онлайн.
Читать онлайн книгу Bioinformatics - Группа авторов страница 56
Figure 4.22 Using BioMart to retrieve the mouse orthologs of the human RefSeqs from the GWAS Catalog. (a) Enter the input RefSeq accession numbers into BioMart. First, create a list of RefSeq accession numbers from the UCSC Table Browser output in Figure 4.12d. BioMart does not accept the accession.version format, so all of the text after the accession number itself will need to be removed. This step can be implemented using a text editor that can perform a wildcard search and replace. For example, to remove the period and all following text from each line, replace ..*
with an empty string. Although the resulting list of accession numbers will contain duplicates, as some RefSeqs have been mapped to alternate loci, any redundancy will be removed from the final BioMart results. At BioMart, click on Filters in the left sidebar, open the Gene menu, and click on Input external references ID list. In the pull-down menu, select RefSeq mRNA IDs as the type of identifier. Paste in the list of accession numbers, which should be of the form NM_001042682. Although BioMart instructions recommend limiting the number of access numbers to 500, the interface will process the 3000+ RefSeq accession numbers from the UCSC Table Browser output. (b) Set the BioMart Attributes (fields to be included in the output). Click on the Attributes in the left sidebar, select Features at the top of the page, then open the Gene menu. Gene stable ID and Transcript stable ID should be selected by default, and will return the Ensembl gene (ENSG) and transcript (ENST) identifiers. Also select Gene name to return the gene symbols (e.g. ADAM18). (c) Set additional Attributes. Close the Gene menu and open the External menu. Navigate to External References and select RefSeq mRNA ID. This step is needed to return the input RefSeq accession numbers so that they can be correlated later with the Ensembl identifiers. (d) BioMart output, including the identifiers requested above. Click on the Results button at the top of the page to retrieve the output. Check the box Unique results only to ensure that duplicated RefSeqs are returned only once. The order of the columns in the results file depends on the order in which the items were added to the list of Attributes. The net result is that each human RefSeq accession from the Table Browser is correlated with its Ensembl Gene and Transcript ID, as well as a gene symbol. (e) BioMart output, with human Ensembl Gene ID and gene symbol, as well as the orthologous mouse Ensembl Gene ID and gene symbol. Start a new query by clicking the New box at the top of the BioMart window. Select the same Database, Dataset, and Filters as before. Under Attributes, select the Homologues radio button. The human Ensembl Gene ID and gene symbol are in the Gene →Ensembl menu, called Gene stable ID and Gene name. The mouse Ensembl Gene ID and gene symbol are in the Orthologues → Mouse Orthologues menu, called Mouse gene stable ID and Mouse gene name. This step outputs the orthologous mouse Ensembl Gene ID and symbol for each human Ensembl Gene ID and symbol. The BioMart output from (d) and (e) can be merged to list the mouse ortholog of each human RefSeq from the GWAS Catalog (Figure 4.12d).
Retrieving the mouse orthologs of the NCBI reference sequences must be done as a separate step, as it is not possible to return an external identifier (i.e. the starting RefSeq accession number) and an ortholog in the same BioMart query. Starting with the same Filter and human RefSeq accession numbers as before, choose the Homologues section of the Attributes and select the human Ensembl gene identifier and gene name under Gene → Ensembl, as well as the mouse Ensembl gene identifier and gene name under Orthologues → Mouse Orthologues. The results are shown in Figure 4.22e. Note that not all of the human gene identifiers have been mapped to a corresponding mouse ortholog. The goal of this exercise was to identify the mouse orthologs of the human RefSeq accession numbers from the GWAS Catalog. Using the human Ensembl gene identifiers as a key, the human RefSeq accession numbers can be added to the list of mouse orthologs. This can be carried out by using the VLOOKUP function in Microsoft Excel, or by writing a script in your favorite programming language, and is left as an exercise for the reader.
JBrowse
While the UCSC and Ensembl Genome Browsers provide user-friendly interfaces for viewing genomic data from well-characterized organisms, there are fewer applications for displaying genome assemblies and annotations for newly sequenced organisms or non-standard assemblies. The source code and executables for the UCSC Genome Browser are freely available for academic, non-profit, and personal use, and can be set up to display custom data, not just those provided by UCSC. Thus, one option is for researchers to host their own UCSC Genome Browser and use it to share custom genomes with the bioinformatics community. An alternate method for sharing novel genome assemblies is to set up an Assembly Hub. Researchers host the specially formatted genomic sequence and data tracks on their own web site, and anyone with the URL can view the assembly though the UCSC Genome Browser.
Another way to share novel genome assemblies is to use JBrowse (Buels et al. 2016), a web-based genome browser that is part of the Generic Model Organism Database (GMOD) project, a suite of tools for generating genomic databases. JBrowse can handle data in a variety of formats, and is relatively easy to install on a Linux- or Mac OS X-based web server (Skinner and Holmes 2010). JBrowse browsers support plant genomes (e.g. Phytozome), animal genomes (e.g. the Rat Genome Database), and disease-related databases of human data (e.g. the COSMIC Genome Browser).
An example of using JBrowse to view a customized genome assembly and associated annotations is at the Mnemiopsis Genome Project (MGP) Portal at the National Human Genome Research Institute (NHGRI) of the US National Institutes of Health (NIH). Mnemiopsis leidyi is a type of ctenophore, or comb jelly, a phylum of gelatinous zooplankton found in all the world's seas. The members of this phylum are called comb jellies because of their highly ciliated comb rows, providing their primary means of locomotion, and these early branching metazoans have proven to be an important model organism for understanding the diversity and complexity seen in the early evolution of animals. The Mnemiopsis data featured in this portal are the first set of whole genome sequencing data on any ctenophore species to be published and made available to the scientific community (Moreland et al. 2014). The portal provides not only genomic and protein model sequence data, but also a BLAST search interface, pathway and protein domain analysis, and a customized genome browser, implemented in JBrowse, to display the annotation data.
The Mnemiopsis genome was assembled into 5100 scaffolds using next generation sequence data from the Roche 454 and Illumina GA-II methods of sequencing (Ryan et al. 2013). The Mnemiopsis protein-coding gene models were predicted by integrating the results of ab initio gene prediction programs with RNA-seq transcript data and sequence similarity to other protein datasets. A view of one of those scaffolds is shown in Figure 4.23. As with the UCSC and Ensembl Genome Browsers, data are organized in horizontal tracks, and exons are shown as colored boxes. The first track, SCF, is the scaffold. The gene model track,