Molecular Biotechnology. Bernard R. Glick
Чтение книги онлайн.
Читать онлайн книгу Molecular Biotechnology - Bernard R. Glick страница 26
Genome Sequence Assembly
A genome sequence can be assembled by aligning the sequences of DNA reads with sequences from a previously determined and highly related (reference) genome. For example, reads from resequenced human genomes, that is, genomes from different individuals, are mapped to a reference human genome. Alternatively, when a reference sequence is not available, the reads can be assembled de novo using a computer program that aligns the matching ends of different reads. The process of generating successive overlapping sequences produces long, contiguous stretches of nucleotides called contigs. The presence of repetitive sequences in a genome can result in erroneous matching of overlapping sequences. This problem can be overcome by using the sequences from both ends of a DNA fragment (paired end reads), which are a known distance apart (when genomic DNA fragments are size selected prior to sequencing), to order and orient the reads and to assemble the contigs into larger scaffolds (Fig. 2.42). Many overlapping reads are required to ensure that the nucleotide sequence is accurate and assembled correctly. Each nucleotide site in a genome is generally sequenced many times from different fragments. The extent of sequencing redundancy, called coverage or depth of coverage, varies from 10 to more than 100, depending on the error rate of the sequencing method, the read length (shorter reads require greater coverage), the complexity of the genome, the assembly method, and the goal of the sequencing project. The assembly process generates a draft sequence; however, small gaps may remain between contigs. Although a draft sequence is sufficient for many purposes, for example, in resequencing projects that map a sequence onto a reference genome, in some cases it is preferable to close the gaps to complete the genome sequence. For de novo sequencing of genomes from organisms that lack a reference genome, gap closure is desirable. The gaps can be closed by PCR amplification of high-molecular-weight genomic DNA across each gap, followed by sequencing of the amplification product, or by obtaining short sequences from primers designed to anneal to sequences adjacent to a gap. Sequencing of additional libraries containing fragments of different sizes may be required to complete the overall sequence.
Figure 2.42 Genome sequence assembly. Sequence data generated from both ends of a DNA fragment are known as paired ends (paired ends are shown in blue for each fragment, and the distance between them is represented by a thin, black line). A large number of reads are generated and assembled into longer contiguous sequences (contigs) using a computer program that matches overlapping sequences. Paired ends help to determine the order and orientation of contigs as they are assembled into scaffolds. Shown is a scaffold consisting of three contigs.
Sequencing Metagenomes
For more than 100 years, the identification of microorganisms and characterization of their biological functions has required cultivating each strain in the laboratory. In the 1990s, with the emergence of techniques to extract DNA directly from environmental samples such as soil and seawater, researchers began to examine the sequence diversity of bacteria using the universal 16S ribosomal RNA gene as a taxonomic marker. These studies revealed that less than 1% of all bacterial species could be cultured, and therefore, novel genes that might be of considerable interest for basic and applied research were inaccessible using methods that depended on growth of bacteria in the laboratory. Considering the wealth of biotechnologically important genes and proteins that had been obtained from the relatively few culturable microorganisms, the possibility of harvesting useful genes from the much greater number of unculturable microorganisms was exciting, if not daunting. With the development of high-throughput next-generation sequencing and algorithms for assembling genome sequences, it has become possible to access the genomes of uncultured organisms from complex environmental and clinical samples. The study of the collective genomes in these samples is known as metagenomics.
The primary objective of a metagenomic project is to construct a comprehensive DNA library from all the microorganisms of a particular ecosystem or location (Fig. 2.43). The entire library is sequenced using a massively parallel approach and assembled into contigs as described above with the aim of determining the sequence of as many different genomes as possible and identifying both novel gene sequences and those that are similar (homologous) to known gene sequences. For example, a massive study that included 50 ocean samples from locations in the North Atlantic through the Panama Canal to the South Pacific yielded 6.3 billion bp of sequence. Analysis of the assembled and nonassembled sequences indicated that there might be as many as 400 new bacterial species among the samples with about 1 × 106 genes that lack significant sequence similarity with any known gene. The analysis also revealed sequences encoding potentially novel forms of many proteins including proteins for repair of ultraviolet light-induced DNA damage and RuBisCO (ribulose bisphosphate carboxylase), an enzyme that is important for carbon fixation.
Figure 2.43 Construction of metagenomic libraries. Bacteria and/or viruses in samples from various environments or tissues are concentrated before extracting and then fragmenting the DNA. Libraries containing the DNA fragments are sequenced or screened for novel genes.
Genomics
Genome sequence determination is only a first step in understanding an organism. The next steps require identification of the features encoded in a sequence and investigations of the biological functions of the encoded RNA, proteins, and regulatory elements that determine the physiology and ecology of the organism. The area of research that generates, analyzes, and manages the massive amounts of information about genome sequences is known as genomics.
Sequence data are deposited and stored in databases that can be searched using computer algorithms to retrieve sequence information (data mining or bioinformatics). Public databases such as GenBank (National Center for Biotechnology Information, Bethesda, MD), the European Molecular Biology Laboratory Nucleotide Sequence Database, and the DNA Data Bank of Japan receive sequence data from individual researchers and from large sequencing facilities and share the data as part of the International Nucleotide Sequence Database Collaboration. Sequences can be retrieved from these databases via the Internet. Many specialized databases also exist, for example, for storing genome sequences from individual organisms, protein coding sequences, regulatory sequences, sequences associated with human genetic diseases, gene expression data, protein structures, protein-protein interactions, and many other types of data.
One of the first analyses to be conducted on a new genome sequence is the identification of descriptive features, a process known as annotation. Some annotations are protein coding sequences (open reading frames), sequences that encode functional RNA molecules (e.g., rRNA and tRNA), regulatory elements, and repetitive sequences. Annotation relies on algorithms that identify features based on conserved sequence elements such as translation start and stop codons, intron-exon boundaries, promoters, transcription factor-binding sites, and known genes (Fig. 2.44). It is important to note that annotations are often predictions of sequence function based on homology to sequences of known functions. In many cases, the function of the sequence remains to be verified through experimentation.
Figure 2.44 Genome annotation utilizes conserved sequence features. Predicting protein coding sequences (open reading frames) in prokaryotes (A) and eukaryotes (B) requires identification of sequences that correspond to potential translation start (ATG or, more rarely, GTG or TTG)