Informatics and Machine Learning. Stephen Winters-Hilt
Чтение книги онлайн.
Читать онлайн книгу Informatics and Machine Learning - Stephen Winters-Hilt страница 23
ML Estimate:
Provides an estimate of r.v. X given that Y = yj in terms of the maximum likelihood:
2.6 Emergent Distributions and Series
In this section we consider a r.v., X, with specific examples where those outcomes are fully enumerated (such as 0 or 1 outcomes corresponding to a coin flip). We review a series of observations of the r.v., X, to arrive at the LLN. The emergent structure to describe a r.v. from a series of observations is often described in terms of probability distributions, the most famous being the Gaussian Distribution (a.k.a. the Normal, or Bell curve).
2.6.1 The Law of Large Numbers (LLN)
The LLN will now be derived in the classic “weak” form. The “strong” form is derived in the modern mathematical context of Martingales in Appendix C.1.
Let Xk be independent identically distributed (iid) copies of X, and let X be the real number “alphabet.” Let μ = E(X), σ2 = Var(X), and denote
From Chebyshev: P(|
As N➔∞ get the LLN (weak):
If Xk are iid copies of X, for k = 1,2,…, and X is a real and finite alphabet, and μ = E(X), σ2 = Var(X), then: P(|
2.6.2 Distributions
2.6.2.1 The Geometric Distribution(Emergent Via Maxent)
Here, we talk of the probability of seeing something after k tries when the probability of seeing that event at each try is “p.” Suppose we see an event for the first time after k tries, that means the first (k − 1) tries were nonevents (with probability (1 − p) for each try), and the final observation then occurs with probability p, giving rise to the classic formula for the geometric distribution:
Figure 2.3 The Geometric distribution, P(X = k) = (1 − p)(k−1) p, with p = 0.8.
As far as normalization, i.e. do all outcomes sum to one, we have:
Total Probability = ∑k = 1(1 – p)(k−1) p = p[1 + (1 – p) + (1 – p)2 + (1 – p)3 + …] = p[1/(1 − (1 − p))] = 1
So total probability already sums to one with no further normalization needed. In Figure 2.3 is a geometric distribution for the case where p = 0.8:
2.6.2.2 The Gaussian (aka Normal) Distribution (Emergent Via LLN Relation and Maxent)
For the Normal distribution the normalization is easiest to get via complex integration (so we'll skip that). With mean zero and variance equal one (Figure 2.4) we get:
2.6.2.3 Significant Distributions That Are Not Gaussian or Geometric
Nongeometric duration distributions occur in many familiar areas, such as the length of spoken words in phone conversation, as well as other areas in voice recognition. Although the Gaussian distribution occurs in many scientific fields (an observed embodiment of the LLN, among other things), there are a huge number of significant (observed) skewed distributions, such as heavy‐tailed (or long‐tailed) distributions, multimodal distributions, etc.
Heavy‐tailed distributions are widespread in describing phenomena across the sciences. The log‐normal and Pareto distributions are heavy‐tailed distributions that are almost as common as the normal and geometric distributions in descriptions of physical phenomena or man‐made phenomena. Pareto distribution was originally used to describe the allocation of wealth of the society, known as the famous 80–20 rule, namely, about 80% of the wealth was owned by a small amount of people, while “the tail,” the large part