Active Learning. Burr Settles
Чтение книги онлайн.
Читать онлайн книгу Active Learning - Burr Settles страница 7
One reasonable way to choose this threshold value is to be as non-committal as possible: set θ to be halfway between the known ⊕ and ⊖ fruits which are closest together (this is what so-called max-margin learning algorithms do). Such a classifier should be fairly confident about its predictions for fruits which are far away from the thresholded classification boundary, but as x (the fruit’s irregularity measure) approaches θ the model becomes much less certain. Intuitively, the instances that are least certain would offer the most information about the problem, since the more confident classifications are probably correct. What if we adopt a simple active learning strategy that queries the instance closest to the decision boundary? In fact, we would recover the binary search algorithm from Section 1.1, as illustrated by Figure 2.1.
This type of active learning strategy is commonly known as uncertainty sampling (Lewis and Catlett, 1994). The basic premise is that the learner can avoid querying the instances it is already confident about, and focus its attention instead on the unlabeled instances it finds confusing. The example in Figure 2.1 quantifies uncertainty by |θ − x|, the distance of instance x from the boundary θ, which is a fine measure for hypothesis classes that provide such a distance measure. However, we are interested in generalizing active learning to more complex problems that go beyond binary classification in a relatively noise-free environment like this one. An elegant way to extend the approach is to use a probabilistic classifier which can output a posterior distribution Pθ(Y|x) over the label variable Y given the input and learned model parameters. Under this interpretation, we would want to query the instance for which Pθ(ŷ|x)—where ŷ refers to the classifier’s most likely prediction for x—is closest to a uniform distribution (0.5 in the case of binary classification). While a probabilistic interpretation is not strictly necessary, there has been significant work in machine learning on probabilistic classifiers, and graphical models in particular (for a thorough overview, see Koller and Friedman, 2009). By framing our discussion of uncertainty sampling in the language of probability, we can easily generalize the techniques in this chapter to a variety of interesting cases, including problems with many input features, multiple output labels, and even structured prediction tasks, which we will discuss later in this chapter.
Figure 2.1: The binary search from Figure 1.3, re-interpreted as an uncertainty sampling approach. The best instance to query is deemed to be the one closest to the threshold θ.
2.2 AN EXAMPLE
To visualize the way in which uncertainty sampling generalizes to a noisy, two-dimensional classification problem, consider Figure 2.2. Figure 2.2(a) shows a toy data set constructed from two Gaussians centered at (-2,0) and (2,0) with standard deviation σ = 1. There are 400 instances total, 200 drawn from each class distribution. In a real-world setting, these instances may be available but their labels would not. Figure 2.2(b) illustrates the traditional supervised learning approach of randomly selecting instances for labeling. The line shows the linear decision boundary of a logistic regression model (i.e., where the posterior label probability equals 0.5) trained using 30 points. Notice that most of the labeled instances in this training set are far from zero on the horizontal axis, which is where the Bayes optimal decision boundary should be. As a result, this classifier only achieves 70% accuracy on the remaining unlabeled data. Figure 2.2(c) tells a different story, however: the active learner uses uncertainty sampling to focus on the instances closest to its decision boundary, assuming it can adequately explain the data in other parts of the input space. As a result, it avoids requesting labels for redundant or irrelevant instances, and achieves 90% accuracy using the same budget of 30 labeled instances.
Figure 2.2: Uncertainty sampling with a toy data set. (a) 400 instances, evenly sampled from two class Gaussians. Instances are represented as points in a 2D input space. (b) A logistic regression model trained with 30 labeled instances randomly drawn from the problem domain. The line represents the decision boundary of the classifier. (c) A logistic regression model trained with 30 actively queried instances using uncertainty sampling.
Figure 2.3: Generic pool-based uncertainty sampling algorithm.
2.3 MEASURES OF UNCERTAINTY
A general active learning algorithm is presented in Figure 2.3. The key component of the algorithm with respect to designing an active learning system is line 5, and we need a way to measure the uncertainty of candidate queries in the pool. For binary classification, the “closest to the decision boundary (probability ≈ 0.5)” heuristic will suffice. But when we deal with problems and models that have posterior distributions over three or more labels—or even multiple output structures—we need a more general measure of uncertainty or information content. From this point on, let
denote the best instance that the utility measure A would select for querying.Least Confident. A basic uncertainty sampling strategy is to query the instance whose predicted output is the least confident:
where ŷ = argmaxy Pθ(y|x), the prediction with the highest posterior probability under the model θ. In other words, this strategy prefers the instance whose most likely labeling is actually the least likely among the unlabeled instances available for querying. One way to interpret this uncertainty measure is the expected 0/1-loss, i.e., the model’s belief that it has mislabeled x. A drawback of this strategy is that it only considers information about the best prediction. Thus, it effectively throws away information about the rest of the posterior distribution.
Margin. A different active learning strategy is based on the output margin:
where ŷ1 and ŷ2 are the first and second most likely predictions under the model, respectively. Margin sampling addresses a shortcoming of the least confident strategy by incorporating the second best labeling in its assessment. Intuitively, instances with large margins are easy, since the learner has little doubt in how to differentiate between the two most probable alternatives. Instances with small margins are more ambiguous, though, and knowing the true labeling should help the model discriminate more effectively between them. However, for problems with a very large number of alternatives, the margin approach still ignores much of the output distribution.
Entropy. Perhaps the most general (and most common) uncertainty sampling strategy uses entropy (Shannon, 1948), usually denoted by H, as the utility measure:
where y ranges over all possible labelings of x. Entropy is a measure of a variable’s average information content. As such, it is often thought of as an uncertainty or impurity measure in machine learning. One interpretation of this strategy is the expected log-loss, i.e., the expected number of bits it will take to “encode” the model’s posterior label distribution.