Algorithms in Bioinformatics. Paul A. Gagniuc
Чтение книги онлайн.
Читать онлайн книгу Algorithms in Bioinformatics - Paul A. Gagniuc страница 20
Biological literature is probably the most sophisticated among all sciences and can be particularly overwhelming. An introduction was made to some important concepts that can provide an overview on living organisms, such as the emergence of life, classification, number of species, the origins of eukaryotic cells, the endosymbiosis theory, organelles, reductive evolution, the importance of HGT, and the main hypotheses regarding the origin of eukaryotic multicellularity. Among the biological concepts described here, some have wider implications. Examples of genome-less organelles, such as hydrogenosomes, or processes such as the HGT, question life as we understand it. Endosymbionts best explain the significance of the environment and also explain the distribution of life in a blurry, nonunitary context. In other words, endosymbiosis widens the threshold of life and shows how difficult it is to place a border between how much life resides inside or outside the cell. Moreover, the HGT appears to connect all the species on earth to a greater or lesser extent. Much evidence shows that some of these ancient processes (e.g. catalytic RNAs) are likely adding or subtracting innovative mechanisms for continuous adaptations among different species (if not all).
2 Tree of Life: Genomes (II)
2.1 Introduction
An insight into the context of biological information is of utmost importance for different approaches in bioinformatics. The first part of the chapter discusses the units of measurement and explains the meaning of some notations used here. A few interesting unit conversions, with accompanying algorithms, are shown in addition to the subject. Next, eukaryotic and prokaryotic organisms with the largest/smallest genomes are presented in detail. Moreover, different computations performed for this chapter show the average genome size above the major kingdoms of life, including the average genome size of different organelles, plasmids, and viruses. Toward the end of the chapter, a comparative analysis is made between the average number of genes and the average number of proteins above the main kingdoms of life. This informative analysis highlights the frequency of a process called alternative splicing, which allows certain eukaryotic genes to encode for several types of proteins.
2.2 Rules of Engagement
Genome size refers to the amount of DNA contained in a haploid genome (a single set of chromosomes). The genome size is expressed in terms of base pairs (bp) and the related transformations: kilo base pairs (1 kbp = 1000 bp), or mega base pairs (1 Mbp = 1 000 000 bp), or giga base pairs (1000 Mbp = 1 Gbp), and so on. By excellence, base pairs are discrete units. Nonetheless, these units of measurement are also used to express averages. For single-stranded DNA (ssDNA)/RNA sequences, the unit of measurement is the nucleotide (nt) and is written as: 1 000 000 nt, 1000 Knt, 1 Mnt, 0.001 Gnt, and so on. However, most often than not, base pairs are written as simple bases when the context is understood (e.g. 1 000 000 b, 1000 kb, 1 Mb, 0.001 Gb, and so on). For instance, the notations “b,” “kb,” “Mb,” “Gb” are used when referring to DNA/RNA sequences in text format. FASTA files contain nucleic acid sequences in the 5′–3′ direction. Technically, all nucleic acids represented as FASTA are single-stranded; however, through complementarity, the reference can be considered as double-stranded. In this chapter, the CG% content is mentioned as an intuitive parameter for the overall composition of the genomes of different species. Note that the (C+G)% or GC% content represents the percentage of guanine and cytosine along a DNA or RNA sequence (e.g. a DNA/RNA fragment, a gene, an entire genome).
2.3 Genome Sizes in the Tree of Life
There is no direct correlation between the genome size of a species and the complexity of its phenotype. In any case, the intellectual curiosity regarding the size of genomes still remains. Determination of genome size based on DNA sequencing data is one of the most accurate methods to date. To observe the lack of correlation between genome size and phenotype, upper-bound extremes can be considered here. As expected in an intuitive manner, eukaryotes show the largest genomes. In animals, the amphibian Ambystoma mexicanum (the Mexican Axolotl) shows the largest (sequenced) genome observed in nature to date. A. mexicanum shows a genome size of 32 396 Mbp (32 Gb) and a physical length that can reach up to 30 cm [166]. In plants, the record is held by Pinus lambertiana (27 603 Mbp) and Sequoia sempervirens (26 537 Mbp). P. lambertiana is the tallest and most massive pine tree [167, 168]. S. sempervirens species includes the tallest living trees on Earth (115.5 m in height or 379 ft) [169]. Among the prokaryotes, Minicystis rosea and Sorangium cellulosum So0157-2 show the largest genomes. The bacterial genome of M. rosea contains 16 Mbp of DNA (GC%: 69.1) and shows the maximum genome size found in prokaryotes [170]. Secondary to this species is the bacterial genome of S. cellulosum So0157-2, with 14.78 Mbp of DNA (GC%: 72.1) [171]. As discussed in the previous chapter, endosymbiosis challenges the notion of the smallest genome necessary for life. The smallest prokaryotic genomes were found in different obligate symbionts. One such case is Nasuia deltocephalinicola with a genome of 112 kbp (0.11 Mbp) [172, 173]. The eukaryotes with the smallest nuclear genome necessary for life are found in the kingdom of fungi. The spore-forming unicellular parasite Encephalitozoon intestinalis shows a genome size of ∼2.3 Mbp and a total of 1.8k protein-coding genes [174]. Nonetheless, the smallest free-living eukaryote is Ostreococcus tauri, a marine green alga with a diameter of about 0.8 μm and a genome size of 12.6 Mbp (8.2k protein-coding genes) [175].
2.3.1 Alternative Methods
The data mentioned above were determined by DNA sequencing approaches made so far. DNA sequencing is an ongoing process for several decades and the species chosen for sequencing are usually either of economic or research importance (or even of historical significance). There are many species that have not yet been sequenced, either due to their minor importance to humans or due to large genomes that cannot be easily managed. Usually, the size of the genetic material can be estimated by methods other than sequencing. One of these methods is flow cytometry, which estimates the weight of the genetic material [176]. This weight, expressed in picograms (pg), can then be converted to base pairs. One picogram is equal to 978 megabase pairs (1 pg = 978 Mbp) [177]. For instance, Paris japonica (flower) shows a genome weight of 152.23 pg, which suggests a genome size of 148 880 Mbp (152.23 pg × 978 Mbp = 149 Gbp) [178].
2.3.2 The Weaving of Scales
To get a sense of genome size closer to our reference system, some transformations can express the mega base pairs as physical lengths. The linear length of a double-stranded DNA (dsDNA) molecule can be calculated by multiplying the average distance between bases (∼3.4 angstrom = 0.34 nm [179, 180]; 1 angstrom = 0.1 nm) by the total number of base pairs in a genome. Here, genomes are expressed in mega base pairs. Since 1Mbp is equal to one million base pairs, the size of a genome can be multiplied by one million and then multiplied further by the average distance between bases (0.34 nm). One meter is equal to 1 000 000 000 nanometers (1 × 109). Thus, the result expressed in nanometers is divided by 1 × 109 for conversion to meters.
Depending on the organism, cells of different tissues can be characterized based on the number of sets of chromosomes present: monoploid (one set of chromosomes), diploid (two sets), triploid (three sets), tetraploid (four sets), pentaploid (five sets), and so on. For instance, the human genome contains 3.1 Gbp (3100 Mbp). Thus, in a human haploid (or monoploid) cell (e.g. a single set of chromosomes found in a gamete), the unfolded length of a single set of chromosomes, arranged linearly one after the other, would show an approximate length of:
Thus, a single set of human chromosomes (n = 23 Chr) can theoretically unfold up to 1 m. However, the human body is constituted