Algorithms in Bioinformatics. Paul A. Gagniuc
Чтение книги онлайн.
Читать онлайн книгу Algorithms in Bioinformatics - Paul A. Gagniuc страница 21
Therefore, the two sets (2n = 46 Chr) of human chromosomes found inside a somatic cell can theoretically unfold up to 2.1 m. The linear length of dsDNA molecules from all chromosomes of a somatic cell and the estimated average number of somatic cells in the human body, can be used for various mental experiments (e.g. comparisons between DNA lengths and cosmic distances). These calculations can be empirically extended for ssDNA molecules placed linearly one after the other. For instance, the 2.1 m of dsDNA from a somatic cell, of course, doubles if the ssDNA approach is considered (2.1 m × 2 DNA strands = 4.2 m of ssDNA). The implementation found in Additional algorithm 2.1 uses the above formula to convert the number of bases of a genome to physical length expressed in meters. Important: For convenience, from this point on all notations “b”, “kb”, “Mb”, “Gb” will refer to dsDNA (double stranded DNA).
Additional algorithm 2.1 Note that the source code is in context and works with copy/paste.
<script> document.write('Homo sapiens (3100 Mb): <br>'); document.write('DNA in a haploid cell nucleus: '); document.write(f(3100) + ' meters<br>'); document.write('DNA in a somatic cell nucleus: '); document.write((2 * f(3100)) + ' meters<br>'); function f(Mb){return (0.34 * 1000000 * Mb)/1000000000;} </script>Output: Homo sapiens (3100 Mb): DNA in a haploid cell nucleus: 1.054 meters DNA in a somatic cell nucleus: 2.108 meters
Above, the example is given on Homo sapiens and the result shows the calculated total length of unfolded chromosomes for both haploid cells and diploid (somatic) cells. This computation can be applied to all genomes mentioned so far by calling function f repeatedly. Thus, Additional algorithm 2.1 is extended to perform this calculation for an arbitrary number of species (Additional algorithm 2.2).
Additional algorithm 2.2 Note that the source code is in context and works with copy/paste.
<script> // DNA to meters var a = 'Ambystoma mexicanum|32396Mb' + 'Pinus lambertiana|27603Mb' + 'Sequoia sempervirens|26537Mb' + 'Minicystis rosea|16Mb' + 'Sorangium cellulosum So0157-2|14.78Mb' + 'Escherichia coli|4.9Mb' + 'Encephalitozoon intestinalis|2.3Mb' + 'Ostreococcus tauri|12.6Mb' + 'Homo sapiens|3100Mb'; var t = a.split('Mb'); for (var u=0; u<t.length-1; u ++) { var r = t[u].split('|'); document.write(r[0] + ' (' + r[1] + ' Mb) = '); document.write(f(r[1]) + ' meters<br>'); } function f(Mb){return (0.34 * 1000000 * Mb)/1000000000;} </script> Output: Ambystoma mexicanum (32396 Mb) = 11.01464 meters Pinus lambertiana (27603 Mb) = 9.38502 meters Sequoia sempervirens (26537 Mb) = 9.02258 meters Minicystis rosea (16 Mb) = 0.00544 meters Sorangium cellulosum So0157-2 (14.78 Mb) = 0.0050252 meters Escherichia coli (4.9 Mb) = 0.0016660000000000002 meters Encephalitozoon intestinalis (2.3 Mb) = 0.0007819999999999999 meters Ostreococcus tauri (12.6 Mb) = 0.004284 meters Homo sapiens (3100 Mb) = 1.054 meters
To call function f repeatedly, a parsing-based method is used. Above, variable a contains a series of records. The structure of these records is based on two delimiters, namely: “|” and “Mb.” Delimiter “|” separates the species name ( r[0]
) from the size of the genome ( r[1]
), while the “Mb” delimiter separates the records from each other ( t[u]
). Please note that 0.001 m equals 1 mm. For instance, the output of Additional algorithm 2.2 shows that Escherichia coli contains a genome of ∼1.6 mm in length (0.0016 m), or that E. intestinalis contains a genome of 0.78 mm in length (0.00078 m).
2.3.3 Computations on the Average Genome Size
A series of computations show the average genome size observed for each division in the tree of life, as well as the average size of viral genomes and the average DNA length of plasmids (Figure 2.1 and Table 2.1). These values were calculated from the raw data extracted from the file transfer protocol (FTP) of the National Center for Biotechnology Information (NCBI). The NCBI section for Genome Information by Organism contains general data in relation to each branch from the tree of life: eukaryotes (13k); prokaryotes (265k); viruses (41k); plasmids (23k); organelles (17k). These categories amount to ∼359k DNA/RNA sequences of different assembly levels of readiness, of which 341k sequence samples of assembly level “complete” were used to calculate the averages presented here. Thus, filters were used to obtain a clean data set. For instance, only levels for “complete chromosomes” or “complete genomes” were considered for these calculations.
Moreover, the maximum values presented in the main text were extracted from these data and checked against the literature. The files containing the raw data can be found in the additional materials online. Important note: The number of samples shown on the last row of Table 1.4 can be misleading. Table 1.4 shows 252k prokaryote samples, whereas the cataloged prokaryotes in Table 1.1 show a total of 12k species. In the NCBI database, prokaryotes have more than one reference or representative genome per species. According to NCBI filters, around 3.2k of the prokaryote genomes are representative.
Figure 2.1 The average genome size. (a) Shows the proportion of known species in each kingdom of life. (b) It shows the tree of life with data on the main kingdoms of life. Each kingdom is labeled with the average genome size and the average GC% content. (c) Shows the average organellar genome for a number of organelles investigated to date. Here, the organelles are sorted by GC%. (d) It shows a comparison between mitochondria and chloroplasts. (e) Shows a comparison between plasmids from bacteria, archaea, and eukaryotes. For each chart (c–e), the left axis indicates the GC% percentage and the right axis indicates the average size of the genome expressed in mega base pairs (written here as Mb instead of Mbp, for ease).
Table 2.1 The average genome size in the tree of life.
Genome size average (Mb) | |
---|---|
Eukaryotes (Mb)
|