Multi-Objective Decision Making. Diederik M. Roijers
Чтение книги онлайн.
Читать онлайн книгу Multi-Objective Decision Making - Diederik M. Roijers страница 7
Finally, we present one more scenario, which we call the known weights scenario, that requires explicit multi-objective methods. In this scenario, we assume that w is known at the time of planning and thus scalarization is possible. However, it may be undesirable because of the difficulty of the second step in the conversion. In particular, if f is nonlinear, then the resulting single-objective problem may be much more complex than the original multi-objective problem. As a result, finding the optimal policy may be intractable while the original multi-objective problem is tractable. For example, in multi-objective Markov decision processes (MOMDPs2), which we formally define in Chapter 2, nonlinear scalarization destroys the additivity property on which single-objective solution methods rely (see Section 3.2.3).
Figure 1.1c depicts the known weights scenario. In contrast to the other scenarios, in the known weights scenario, the multi-objective method only produces one policy, which is then executed, i.e., there is no separate selection phase.
The scenarios we have presented here require explicit multi-objective methods because a priori scalarization of the multi-objective decision problems, and subsequent solving with standard single-objective methods, does not apply. In this book, we focus on the two multi-policy scenarios, i.e., the unknown weights and decision support scenarios, in which the goal of a multi-objective planning method is to produce a coverage set. From this coverage set, the policy that maximizes user utility is selected in the selection phase.
Of course, computing a coverage is in general more difficult than finding a single policy, and thus multi-objective methods are typically more expensive than their single-objective counterparts. It is natural then to wonder whether the complications of a multi-objective approach are merited. After all, many single-objective problems are already intractable. In this book, we take a pragmatic perspective: multi-objective methods are not a panacea and are not always the best option, even if the problem is naturally modeled with multiple objectives. If the scalarization weights are known (or can be reasonably estimated) before planning begins, and a priori scalarization does not yield an intractable problem, then converting a multi-objective problem to a single-objective one may be the best option. However, in any of the many cases where such a conversion is not possible or practical, then the multi-objective methods discussed in this book may prove indispensable.
1.2 UTILITY-BASED APPROACH
The goal of solving all—including multi-objective—decision problems is to maximize user utility. However, in the unknown weights and decision support scenarios, we cannot optimize user utility directly because, at the time when planning or learning takes place, the parameters w of the scalarization function f, which maps the multi-objective values to a scalar utility, are unknown. Therefore, we must compute a coverage set: a set of policies such that, for every possible scalarization, a maximizing policy is in the set (see Definition 3.5).
In this book, we argue that which policies should be included in the coverage set should be derived from what we know about f. We call this the utility-based approach, in contrast to the axiomatic approach that axiomatically assumes the coverage set is the Pareto front, which we define formally in Chapter 3. In short, the Pareto front is the set of all policies for which there is no other policy that has at least equal value in all objectives and has a higher value in at least one objective. Indeed, the Pareto front contains at least one optimal policy for most, if not all, scalarization functions that occur in practice. However, we argue that, while the Pareto front is sufficient, it is often not necessary. In fact, we show in Chapter 3 that the full Pareto front is required only in the special case where the scalarization function is nonlinear and policies must be deterministic. A utility-based approach thus often results in a much smaller coverage set, which is typically less expensive to compute and easier for the user to examine.
Another advantage of the utility-based approach is that it is possible to derive how much utility is maximally lost when an exact coverage set cannot be computed [Zintgraf et al., 2015]. Such bounds are often crucial for a meaningful interpretation of the quality of approximate methods for decision-theoretic planning or learning, especially when comparing algorithms [Oliehoek et al., 2015]. Furthermore, the bounds provide insight into the quality and reliability of the selected final policy.
1We provide a formal definition of the term coverage set in Chapter 3, Definition 3.5.
2This abbreviation is also used for mixed-observability MDPs [Ong et al., 2010], which we do not consider here; we use the abbreviation MOMDPs solely for multi-objective MDPs.
CHAPTER 2
Multi-Objective Decision Problems
In this chapter, we introduce the concept of a multi-objective decision problem. Then we describe two concrete classes of multi-objective decision problems that we use throughout the book: multi-objective coordination graphs and multi-objective Markov decision processes. However, before introducing the concrete multi-objective decision problems, we first introduce their single-objective counterparts.
2.1 MULTIPLE OBJECTIVES
In this book, we focus on different (cooperative) multi-objective decision problems. Multi-objective decision problems contrast single-objective decision problems, which are more common in the literature. In their most general form, single-objective decision problems can be defined as a set of policies and a value function that we wish to maximize:
Definition 2.1 A cooperative single-objective decision problem (SODP), consists of:
• a set of allowed (joint) policies Π,
• a value function that assigns a real numbered value, Vπ ∈ ℝ, to each joint policy π ∈ Π, corresponding to the desirability, i.e., the utility, of the policy.
Definition 2.2 In a cooperative multi-objective decision problem (MODP), Π is the same as in
an SODP, but
• there are d ≥ 1 objectives, and
• the value function assigns a value vector, Vπ ∈ ℝd, to each joint policy π ∈ Π, corresponding to the desirability of the policy with respect to each objective.
We denote the value of policy π in the i-th objective as
Both Vπ and Π may have particular forms that affect the way they should be computed, as we discuss in Chapter 3. For example, there may be multiple agents, each with its own action space. In such settings, a solution consists of a joint policy containing a local policy for each agent. We assume that Π is known and that it is in principle possible to compute the value of each (joint) policy. Furthermore, we assume that, if there are multiple agents, these agents are cooperative.