Bioinformatics. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Bioinformatics - Группа авторов страница 46

Bioinformatics - Группа авторов

Скачать книгу

Tyra G. Wolfsberg

      The first complete sequence of a eukaryotic genome – that of Saccharomyces cerevisiae – was published in 1996 (Goffeau et al. 1996). The chromosomes of this organism, which range in size from 270 to 1500 kb, presented an immediate challenge in data management, as the upper limit for single database entries in GenBank at the time was 350 kb. To better manage the yeast genome sequence, as well as other chromosome and genome-length sequences being deposited into GenBank around that time, the National Center for Biotechnology Information (NCBI) at the National Institutes of Health (NIH) established the Genomes division of Entrez (Benson et al. 1997). Entries in this division were organized around a reference sequence onto which all other sequences from that organism were aligned. As these reference sequences have no size limit, “virtual” reference sequences of large genomes or chromosomes could be assembled from shorter GenBank sequences. For partially sequenced chromosomes, NCBI developed methods to integrate genetic, physical, and cytogenetic maps onto the framework of the whole chromosome. Thus, Entrez Genomes was able to provide the first graphical views of large-scale genomic sequence data.

      The working draft of the human genome, completed in February 2001 (Lander et al. 2001), generated virtual reference sequences for each human chromosome, ranging in size from 46 to 246 Mb. NCBI created the first version of its human Map Viewer (Wheeler et al. 2001) shortly thereafter, in order to display these longer sequences. Around the same time, the University of California, Santa Cruz (UCSC) Genome Bioinformatics Group was developing its own human genome browser, based on software originally designed for displaying the much smaller Caenorhabditis elegans genome (Kent and Zahler 2000). Similarly, the Ensembl project at the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) was also producing a system to automatically annotate the human genome sequence, as well as store and visualize the data (Hubbard et al. 2002). The three genome browsers all came online at about the same time, and researchers began using them to help navigate the human genome (Wolfsberg et al. 2002). Today, each site provides free access not only to human sequence data but also to a myriad of other assembled genomic sequences, from commonly used model organisms such as mouse to more recently released assemblies such as those of the domesticated turkey. Although the NCBI's Map Viewer is not being further developed and will be replaced by its new Genome Data Viewer (Sayers et al. 2019), the UCSC and Ensembl Genome Browsers continue to be popular resources, used by most members of the bioinformatics and genomics communities. This chapter will focus on the last two genome browsers.

      While the GRC also assembles the mouse, zebrafish, and chicken genomes, other genomes are sequenced and assembled by specialized sequencing consortia. The panda genome sequence, published in 2009, was the first mammalian genome to abandon the clone-based sequencing strategies used for human and mouse, relying entirely on next generation sequencing methodologies (Li et al. 2010). Subsequent advances in sequencing technologies have led to rapid increases in the number of complete genome sequences. At the time of this writing, both the UCSC Genome Browser and the main Ensembl web site host genome assemblies of over 100 organisms. The look and feel of each genome browser is the same regardless of the species displayed; however, the types of annotation differ depending on what data are available for each organism.

      The backbone of each browser is an assembled genomic sequence. Although the underlying genomic sequence is, with a few exceptions, the same in both genome browsers, each team calculates its annotations independently. Depending on the type of analysis, a user may find that one genome browser has more relevant information than the other. The location of genes, both known and predicted, is a central focus of both genome browsers. For human, at present, both browsers feature the GENCODE gene predictions, an effort that is aimed at providing robust evidence-based reference gene sets (Harrow et al. 2012). Other types of genomic data are also mapped to the genome assembly, including NCBI reference sequences, single-nucleotide polymorphisms (SNPs) and other variants, gene regulatory regions, and gene expression data, as well as homologous sequences from other organisms. Both genome browsers can be accessed through a web interface that allows users to navigate through a graphical view of the genome. However, for those wishing to carry out their own calculations, sequences and annotations can also be retrieved in text format. Each browser also provides a sequence search tool – BLAT (Kent 2002) or BLAST (Camacho et al. 2009) – for interrogating the data via a nucleotide or protein sequence query. (Additional information on both BLAT and BLAST is provided in Chapter 3.)

      In order to provide stability and ensure that old analyses can be reproduced, both genome browsers make available not only the current version of the genome assemblies but older ones as well. In addition, annotation tracks, such as the GENCODE gene track and the SNP track, may be based on different versions of the underlying data. Thus, users are encouraged to verify the version of all data (both genome assembly and annotations) when comparing a region of interest between the UCSC and Ensembl Genome Browsers.

      After starting in 2000 with just a display of an early draft of the human genome assembly, the UCSC Genome Browser now provides access to assemblies and annotations from over 100 organisms (Haeussler et al. 2019). The majority of assemblies are of mammalian genomes, but other vertebrates, insects, nematodes, deuterostomes, and the Ebola virus are also included. The assemblies from some organisms, including human and mouse, are available in multiple versions. New organisms and assembly versions are added regularly.

      The

Скачать книгу