Informatics and Machine Learning. Stephen Winters-Hilt

Чтение книги онлайн.

Читать онлайн книгу Informatics and Machine Learning - Stephen Winters-Hilt страница 26

Informatics and Machine Learning - Stephen Winters-Hilt

Скачать книгу

The axiomatic approach is limited by the assumptions of its axioms, however, so it was not until the fundamental role of relative entropy was established in an “information geometry” context [113–115], that a path to show that Shannon entropy is uniquely qualified as a measure was established (c. 1999). The fundamental (extremal optimum) aspect of relative entropy (and Shannon entropy as a simple case) is found by differential geometry arguments akin to those of Einstein on Riemannian spaces (here involving spaces defined by the family of exponential distributions). Whereas the “natural” notion of metric and distance locally is given by the Minkowski metric and Euclidean distance, a similar analysis on comparing distributions (evaluating their “distance” from eachother) indicates the natural measure is relative entropy (which reduces to Shannon entropy in variational contexts when the relative entropy is relative to the uniform probability distribution). Further details on this derivation are given in Chapter 8.

      3.1.1 The Khinchin Derivation

      In his now famous 1948 paper [106] , Claude Shannon provided a qualitative measure for entropy in connection with communication theory. The Shannon entropy measure was later put on a more formal footing by A. I. Khinchin in an article where he proves that with certain assumptions the Shannon entropy is unique [107] . (Dozens of similar axiomatic proofs have since been made.) A statement of the theorem is as follows:

      Khinchine Uniqueness Theorem: Let H(p1, p2, …, pn) be a function defined for any integer n and for all values p1, p2, , pn such that pk≥ 0 (k = 1, 2, …, n), and kpk = 1. If for any function n this function is continuous with respect to its arguments, and if the function obeys the three properties listed below, then H(p1, p2, …, pn) = −λ∑kpklog(pk), where λ is a positive constant (with Shannon entropy recovered for convention λ = 1). The three properties are:

      1 For given n and for ∑kpk = 1, the function takes its largest value for pk = 1/n (k = 1, 2, …, n). This is equivalent to Laplace’s principle of insufficient reason, which says if you do not know anything assume the uniform distribution (also agrees with Occam’s Razor assumption of minimum structure).

      2 H(ab) = H(a) + Ha(b), where Ha(b) = –∑ap(a)log(p(b|a)), is the conditional entropy. This is consistent with H(ab) = H(a) + H(b), for probabilities of a and b independent, with modifications involving conditional probability being used when not independent.

      3 H(p1, p2, …, pn, 0) = H(p1, p2, …, pn). This reductive relationship, or something like it, is implicitly assumed when describing any system in “isolation.”

      Note that the above axiomatic derivation is still “weak” in that it assumes the existence of the conditional entropy in property (2).

      3.1.2 Maximum Entropy Principle

      If no constraint on probabilities, other than that they sum to 1, the Lagrangian form for the optimization is as follows:

upper L left-parenthesis left-brace p Subscript k Baseline right-brace right-parenthesis equals minus sigma-summation left-parenthesis p Subscript k Baseline log left-parenthesis p Subscript k Baseline right-parenthesis right-parenthesis minus lamda left-parenthesis 1 minus sigma-summation left-parenthesis p Subscript k Baseline right-parenthesis right-parenthesis

      where, ∂L/∂pk = 0 → pk = e−(1 + λ) for all k, thus pk = 1/n for system with n outcomes. Thus, the maximum entropy hypothesis in this circumstance results in Laplace’s Principle of Insufficient Reasoning, a.k.a., principle of indifference, where if you do not know any better, use the uniform distribution.

      If you have as prior information the existence of the mean, μ, of some quantity x, then you have the Lagrangian:

upper L left-parenthesis left-brace p Subscript k Baseline right-brace right-parenthesis equals minus sigma-summation left-parenthesis p Subscript k Baseline log left-parenthesis p Subscript k Baseline right-parenthesis right-parenthesis minus lamda left-parenthesis 1 minus sigma-summation left-parenthesis p Subscript k Baseline right-parenthesis right-parenthesis minus delta left-parenthesis mu minus sigma-summation left-parenthesis p Subscript k Baseline x Subscript k Baseline right-parenthesis right-parenthesis

      where, ∂L/∂pk = 0 → pk = A exp(−δxk), leading to the exponential distribution. If for the latter we had the mean of the function, f(xk), of some random variable X, then a similar derivation would again yield the exponentional distribution pk = A exp(−δf(xk) ), where now A is not simply a normalization factor, but is known as the partition function and it has a variety of generative properties vis‐à‐vis statistical mechanics and thermal physics.

      If you have as prior information the existence of the mean and variance of some quantity (the first and second statistical moments), then you have the Lagrangian:

upper L left-parenthesis left-brace p Subscript k Baseline right-brace right-parenthesis equals minus sigma-summation left-parenthesis p Subscript k Baseline log left-parenthesis p Subscript k Baseline right-parenthesis right-parenthesis minus lamda left-parenthesis 1 minus sigma-summation left-parenthesis p Subscript k Baseline right-parenthesis right-parenthesis minus delta left-parenthesis mu minus sigma-summation left-parenthesis p Subscript k Baseline x Subscript k Baseline right-parenthesis right-parenthesis minus gamma left-parenthesis nu minus sigma-summation left-parenthesis p Subscript k Baseline left-parenthesis x Subscript k Baseline right-parenthesis squared right-parenthesis right-parenthesis

      where, ∂L/∂pk = 0→ the Gaussian distribution (see Exercise 3.3).

      With

Скачать книгу