Hardness of Approximation Between P and NP. Aviad Rubinstein
Чтение книги онлайн.
Читать онлайн книгу Hardness of Approximation Between P and NP - Aviad Rubinstein страница 13
■
2.5 Learning Theory
For a universe (or ground set) U, a concept C is simply a subset of U and a concept class C is a collection of concepts. For convenience, we sometimes relax the definition and allow the concepts to not be subsets of U; all definitions here extend naturally to this case.
The VC and Littlestone’s Dimensions can be defined as follows.
Definition 2.8
VC Dimension [Vapnik and Chervonenkis 1971]. A subset S ⊆ U is said to be shattered by a concept class C if, for every T ⊆ S, there exists a concept C ∈ C such that T = S ⋂ C.
The VC Dimension VC-dim(C, U) of a concept class C with respect to the universe U is the largest d such that there exists a subset S ⊆ U of size d that is shattered by C.
Definition 2.9
Mistake tree and Littlestone’s Dimension [Littlestone 1987]. A depth-d instance-labeled tree of U is a full binary tree of depth d such that every internal node of the tree is assigned an element of U. For convenience, we will identify each node in the tree canonically by a binary string s of length at most d.
A depth-d mistake tree (aka shattered tree [Ben-David et al. 2009]) for a universe U and a concept class C is a depth-d instance-labeled tree of U such that, if we let vs ∈ U denote the element assigned to the vertex s for every s ∈ {0, 1}<d, then, for every leaf ℓ ∈ {0, 1}d, there exists a concept C ∈ C that agrees with the path from root to it, i.e., that, for every
The Littlestone’s Dimension L-dim(C, U) of a concept class C with respect to the universe U is defined as the maximum d such that there exists a depth-d mistake tree for U, C.
An equivalent formulation of Littlestone’s Dimension is through mistakes made in online learning, as stated below. This interpretation will be useful in the proof of Theorem 14.4.
Definition 2.10
Mistake bound. An online algorithm A is an algorithm that, at time step i, is given an element xi ∈ U and the algorithm outputs a prediction pi ∈ {0, 1} whether x is in the class. After the prediction, the algorithm is told the correct answer hi ∈ {0, 1}. For a sequence (x1, h1),…, (xn, hn), a prediction mistake of A is defined as the number of incorrect predictions, i.e., Σi∈n 1[pi ≠ hi]. The mistake bound of A for a concept class C is defined as the maximum prediction mistake of A over all the sequences (x1, h1),…, (xn, hn), which corresponds to a concept C ∈ C (i.e., hi = 1[xi ∈ C] for all i ∈ [n]).
Theorem 2.3
Littlestone [1987]. For any universe U and any concept class C, L-dim(C, U) is equal to the minimum mistake bound of C, U over all online algorithms.
The following facts are well-known and follow easily from the above definitions.
Fact 2.2
For any universe U and concept class C, we have
Fact 2.3
For any two universes U1, U2 and any concept class C,
2.6 Information Theory
In this section, we introduce information-theoretic quantities used mostly in Chapter 12. For a more thorough introduction, the reader should refer to Cover and Thomas [2012]. Unless stated otherwise, all log’s in this book paper are base-2.
Definition 2.11
Let μ be a probability distribution on sample space Ω. The Shannon entropy (or just entropy) of μ, denoted by H(μ), is defined as
Definition 2.12
Binary entropy function. For p ∈ [0, 1], the binary entropy function is defined as follows (with a slight abuse of notation): H(p) := − p log p − (1 − p) log(1 − p).
Fact 2.4
Concavity of binary entropy. Let μ be a distribution on [0, 1], and let p ~ μ. Then
For a random variable A, we shall write H(A) to denote the entropy of the induced distribution on the support of A. We use the same abuse of notation for other information-theoretic quantities appearing later in this section.
Definition 2.13
The conditional entropy of a random variable A conditioned on Β is defined as
Fact 2.5
Chain rule.
Fact 2.6
Conditioning decreases entropy.
Another measure we will use (briefly) in our proof is that of mutual information, which informally captures the correlation between two random variables.
Definition 2.14
Conditional mutual information. The mutual information between two random variables A and Β, denoted by I(A; B), is defined as
The conditional mutual information between A and Β given C, denoted by I(A; B|C), is defined as