Читать онлайн книгу - Informatics and Machine Learning. Stephen Winters-Hilt. Математика. LiveLib

Новинки Лучшее Рекомендации

Информация о книге:

Название:

Автор:

Жанр:

Серия:

Издательство:

Informatics and Machine Learning - Stephen Winters-Hilt

Скачать книгу

internet file‐size traffic is a long‐tailed distribution, that is, there are a few large sized files and many small sized files to be transferred. This distribution assumption is an important factor that must be considered to design a robust and reliable network and Pareto distribution could be a suitable choice to model such traffic. (Internet applications have many other heavy‐tailed distribution phenomena.) Pareto distributions can also be found in a lot of other fields, such as economics.

Schematic illustration of the Gaussian distribution, aka Normal, shown with mean zero and variance equal to one.

Figure 2.4 The Gaussian distribution, aka Normal, shown with mean zero and variance equal to one: N_x(μ,σ²) = N_x(0,1).

Log‐normal distributions are used in geology and mining, medicine, environment, atmospheric science, and so on, where skewed distribution occurrences are very common. In Geology, the concentration of elements and their radioactivity in the Earth's crust are often shown to be log‐normal distributed. The infection latent period, the time from being infected to disease symptoms occurs, is often modeled as a log‐normal distribution. In the environment, the distribution of particles, chemicals, and organisms is often log‐normal distributed. Many atmospheric physical and chemical properties obey the log‐normal distribution. The density of bacteria population often follows the log‐normal distribution law. In linguistics, the number of letters per words and the number of words per sentence fit the log‐normal distribution. The length distribution for introns, in particular, has very strong support in an extended heavy‐tail region, likewise for the length distribution on exons or open reading frames (ORFs) in genomic deoxyribonucleic acid (DNA). The anomalously long‐tailed aspect of the ORF‐length distribution is the key distinguishing feature of this distribution, and has been the key attribute used by biologists using ORF finders to identify likely protein‐coding regions in genomic DNA since the early days of (manual) gene structure identification.

2.6.3 Series

A series is a mathematical object consisting of a series of numbers, variables, or observation values. When observations describe equilibrium or “steady state,” emergent phenomenon familiar from physical reality, we often see series phenomena that are martingale. The martingale sequence property can be seen in systems reaching equilibrium in both the physical setting and algorithmic learning setting.

A discrete‐time martingale is a stochastic process where a sequence of r.v. {X₁, …, X_n} has conditional expected value of the next observation equal to the last observation: E(X_n+1 | X₁, … X_n) = X_n, where E(|X_n|) < ∞. Similarly, one sequence, say {Y₁,…, Y_n}, is said to be martingale with respect to another, say {X₁,…, X_n}, if for all n: E(Y_n+1 | X₁, … X_n) = Y_n, where E(|Y_n|) < ∞. Examples of martingales are rife in gambling. For our purposes, the most critical example is the likelihood‐ratio testing in statistics, with test‐statistic, the “likelihood ratio” given as: Y_n = Πⁿ_i=1 g (X_i)/ f (X_i), where the population densities considered for the data are f and g . If the better (actual) distribution is f , then Y_n is martingale with respect to X_n. This scenario arises throughout the hidden Markov models (HMM) Viterbi derivation if local “sensors” are used, such as with profile‐HMM's or position‐dependent Markov models in the vicinity of transition between states. This scenario also arises in the HMM Viterbi recognition of regions (versus transition out of those regions), where length‐martingale side information will be explicitly shown in Chapter 7, providing a pathway for incorporation of any martingale‐series side information (this fits naturally with the clique‐HMM generalizations described in Chapter 7). Given that the core ratio of cumulant probabilities that is employed is itself a martingale, this then provides a means for incorporation of side‐information in general (further details in Appendix C).

2.7 Exercises

1 2.1 Evaluate the Shannon Entropy, by hand, for the fair die probability distribution: (1/6,1/6,1/6,1/6,1/6,1/6), for the probability of rolling a 1 thru a 6 (all are the same, 1/6, for uniform prob. Dist). Also evaluate for loaded die: (1/10,1/10,1/10,1/10,1/10,1/2).

2 2.2 Evaluate the Shannon Entropy for the fair and loaded probability distribution in Exercise 2.1 computationally, by running the program described in Section 2.1.

3 2.3 Now consider you have two dice, where each separately rolls “fair,” but together they do not roll “fair,” i.e. each specific pair of die rolls does not have probability 1/36, but instead has probability:Die 1 rollDie 2 rollProbability11(1/6)*(0.001)12(1/6)*(0.125)13(1/6)*(0.125)14(1/6)*(0.125)15(1/6)*(0.124)16(1/6)*(0.5)2Any(1/6)*(1/6)3Any(1/6)*(1/6)4Any(1/6)*(1/6)5Any(1/6)*(1/6)61(1/6)*(0.5)62(1/6)*(0.125)63(1/6)*(0.125)64(1/6)*(0.125)65(1/6)*(0.124)66(1/6)*(0.001)What is Shannon Entropy for the Die 1 outcomes? (call H(1)) What is the Shannon entropy of the Die 2 outcomes (refer to as H(2))? What is the Shannon entropy on the two‐dice outcomes with probabilities shown in the table above (denote (H(1,2))?Compute the function MI(Die 1,Die 2) = H(1) + H(2) − H(1,2). Is it positive?

4 2.4 Go to genbank (https://www.ncbi.nlm.nih.gov/genbank/) and select the genome of a small virus (~10 kb). Using the Python code shown in Section 2.1, determine the base frequencies for {a,c,g,t}. What is the shannon entropy (if those frequencies are taken to be the probabilities on the associated outcomes)?

5 2.5 Go to genbank (https://www.ncbi.nlm.nih.gov/genbank/) and select the genome of three medium‐sized viruses (~100 kb). Using the Python code shown in Section 2.1, determine the trinucleotide frequencies. What is the Shannon entropy of the trinucleotide frequencies for each of the three virus genomes? Using this as a distance measure phylogenetically speaking, which two viruses are most closely related?

6 2.6 Repeat (Exercise 2.5) but now use symmetrized relative entropy between the trinucleotide probability distributions as a distance measure instead (reevaluate pairwise between the three viruses). Using this as a distance measure phylogenetically speaking, which two viruses are most closely related?

7 2.7 Prove that relative entropy is always positive (hint: use Jensen's Inequality from Section 2.4).

8 2.8 What is the Expectation for the two‐dice roll with pair outcome probabilities listed in (Exercise 2.3)?

9 2.9 What is the Expectation for the two‐dice roll with fair dice? Is this expectation an actual outcome possibility? What does it mean if it is not?

10 2.10 Survey the literature and write a report on common occurrences of distributions of the type: uniform, geometric, exponential, Gaussian, log‐normal, heavy‐tail.

11 2.11

Скачать книгу

Informatics and Machine Learning. Stephen Winters-Hilt

Чтение книги онлайн.

Читать онлайн книгу Informatics and Machine Learning - Stephen Winters-Hilt страница 24

Информация о книге:

2.6.3 Series

2.7 Exercises