Active Learning. Burr Settles
Чтение книги онлайн.
Читать онлайн книгу Active Learning - Burr Settles страница 4
Supposing we have a way to quantify the “irregularity” of a fruit’s shape, we can formalize this classification task using a simple function. Let X be a range of data instances, e.g., fruits that have been harvested, where each x ∊ R represents the irregularity measure of a particular fruit. Each data instance has a hidden label y, which in this example comes from the finite set Y = {safe, noxious}
. Our classifier, then, is a function mapping h : X → Y, parameterized by a threshold θ:
The remaining challenge is to figure out what this threshold really is.
One way to pick the threshold is by supervised learning techniques, where we estimate θ from a set of labeled data instances
. Here, 〈x, y〉 denotes a fruit x which has been tested and assigned a label y, e.g., by having someone eat it and then observing whether or not he or she becomes ill. A simple learning algorithm in this case isered set of fruits, we only need to te to harvest a large number of fruits, arrange them in order from round to irregular in shape, and have them all tested. The point at which the fruits switch from beingsafe
to noxious
is the best threshold θ* (assuming for the moment that everyone responds to a fruit in the same way). Figure 1.2 illustrates this method for a handful of labeled instances.
Figure 1.2: Supervised learning for the alien fruits example. Given a set of 〈x, y〉 instance-label pairs, we want to choose the threshold θ* that classifies them most accurately.
According to the PAC learning1 framework (Valiant, 1984), if the underlying data distribution can be perfectly classified by some hypothesis h in the hypothesis class H (in this case, the set of all values for θ), then it is enough to test O(1/∊) randomly selected instances, where ∊ is the maximum desired error rate. This means that if we want to achieve 99% accuracy in our “fruit safety” classifier, we might need to harvest and test on the order of hundreds of fruits using this approach. That is a lot of fruits to test, though, and many of your family and friends might get sick in the process!
Can we do better? While we certainly want a threshold that classifies these fruits as accurately as possible, we also want to discover this threshold as quickly and as economically as possible, by requiring fewer tests while achieving the same results (ideally). Note some key properties of this problem:
• the fruits themselves are plentiful, easily harvested, and easily measured;
• testing for a fruit’s safety is what mainly incurs a cost: the fruit must be consumed, and the person who eats it risks falling ill.
Figure 1.3: A binary search through the set of ordered, untested alien fruits. By only testing this subset of fruits, we can exponentially reduce costs while achieving the same result. The labels shown in light blue can be inferred, and therefore do not need to be tested.
As it turns out, we can do better by performing a directed search through the set of fruit shapes. Taking the two observations above into account, we can augment our supervised learning strategy in the following way. First, let us gather an arbitrary number of unlabeled instances
for free (or very inexpensively) by simply harvesting a lot of fruits without testing them. Next, we can arrange these fruits in a sequence from round to irregular. Our goal is similar to the previous one: to discover where in this ordered series of fruits the transition fromsafe
to noxious
occurs, but by conducting as few tests as possible this time. If we execute a simple binary search through this ordered set of fruits, we only need to test ⌈log2 U⌉ items, instead of testing them all as we did before.
The process is illustrated in Figure 1.3. Given a set of unlabeled instances, this sequential algorithm would first test x = 5, then x = 7, and finally x = 6 before arriving at the exact same parameter value for θ*, alleviating the need to test the other six fruits (two of which happen to be harmful). This algorithmic speedup means that a classifier with error ∊ or less can be found with a mere O(log2 1/∊) tests, since all the other labels can be inferred. To put this in perspective, assume that 100 fruits were needed to obtain 99% accuracy in the earlier supervised setting. Then we would still need to harvest 100 fruits for our new binary search algorithm, but we would only need to test 6 or 7 of them to get the same guarantee. This is an exponential reduction in the number of tests necessary, which drastically reduces the number of people whose health is at risk, and helps the colony to make use of both safe and noxious fruits as quickly as possible.
1.2 ACTIVE LEARNING
The alien fruits example is a simple illustration of active learning. Active learning is a subfield of artificial intelligence and machine learning: the study of computer systems that improve with experience and training. Traditional “passive” learning systems induce a hypothesis to explain whatever training data happens to be available (e.g., a collection of labeled instances). By contrast, the hallmark of an active learning system is that it eagerly develops and tests new hypotheses as part of a continuing, interactive learning process. Another way to think about it is that active learners develop a “line of inquiry,” much in the way a scientist would design a series of experiments to help him or her draw conclusions as efficiently as possible. For example, the binary search algorithm in Figure 1.3 selects the next fruit to test—or query—after it has obtained an answer for the previous query from some oracle (or labeling source). For this reason, active learning is sometimes called “query learning” in the computational learning theory literature, and is closely related to work in “optimal experimental design” or “sequential design” in statistics.
Of course, the alien fruits example is a bit contrived and overly simplistic. Yet it illustrates some of the basic properties of many real-world problems, and shows how much can be gained from having the learner ask questions or be more involved in its own training process. In today’s information-drenched society, unlabeled data are often abundant, but the labels required to do supervised learning from these instances are difficult, time-consuming, or expensive to obtain. Consider a few examples:
• Classification and filtering. Learning to classify documents (e.g., articles or web pages) or any other kind of media (e.g., image, audio, and video files) usually requires people to annotate each item with particular labels, such as relevant
or not-relevant
. Unlabeled instances abound from electronic resources like the Internet, but annotating thousands of these instances can be tedious and even redundant.
• Speech recognition. Accurate labeling of speech utterances is extremely time consuming and requires trained linguists. While unannotated audio data is easily obtained from recording devices,