Читать онлайн книгу - Multi-Objective Decision Making. Diederik M. Roijers. Программы. Synthesis Lectures on Artificial Intelligence and Machine LearningLiveLib

Новинки Лучшее Рекомендации

Информация о книге:

Название:

Автор:

Жанр:

Серия:

Издательство:

Multi-Objective Decision Making - Diederik M. Roijers Synthesis Lectures on Artificial Intelligence and Machine Learning

Скачать книгу

is not limited, but there is a discount factor, 0 ≤ γ < 1, that specifies the relative importance of future rewards with respect to immediate rewards:

Which action a_t is chosen by the agent at each timestep t depends on its policy π. If the policy is stationary, i.e., it conditions only on the current state, then it can be formalized as π : S × A → [0, 1]: it specifies, for each state and action, the probability of taking that action in that state. We can then specify the state value function of a policy π:

for all t when s_t = s. The Bellman equation restates this expectation recursively for stationary policies:

Note that the Bellman equation, which forms the heart of most standard solution algorithms such as dynamic programming [Bellman, 1957a] and temporal difference methods [Sutton and Barto, 1998], explicitly relies on the assumption of additive returns. This is important because nonlinear scalarization functions f can interfere with this additivity property, making planning and learning methods that rely on the Bellman equation not directly applicable, as we discuss in Section 3.2.3.

State value functions induce a partial ordering over policies, i.e., π is better than or equal to π′ if and only if its value is greater for all states:

A special case of a stationary policy is a deterministic stationary policy, in which one action is chosen with probability 1 for every state. A deterministic stationary policy can be seen as a mapping from states to actions: π : S → A. For single-objective MDPs, there is always at least one optimal policy π, i.e., , that is stationary and deterministic.

Theorem 2.7 For any additive infinite-horizon single-objective MDP, there exists a deterministic stationary optimal policy [Boutilier et al., 1999, Howard, 1960].

If more than one optimal policy exists, they share the same value function, known as the optimal value function V*(s) = max_π V^π(s). The Bellman optimality equation defines the optimal value function recursively:

Note that, because it maximizes over actions, this equation makes use of the fact that there is an optimal deterministic stationary policy. Because an optimal policy maximizes the value for every state, such a policy is optimal regardless of the initial state distribution μ₀. However, the state-independent value (Equation 2.1) can be different for different initial state distributions. Using μ₀, the state value function can be translated back into the state-independent value function (Equation 2.1):

2.3.2 MULTI-OBJECTIVE MARKOV DECISION PROCESSES

In many decision problems, such as the social robot in Figure 2.2, it is impossible, undesirable, or infeasible to define a scalar reward function, and we need a vector-valued reward function, leading to an MOMDP.

Definition 2.8 A multi-objective Markov decision process (MOMDP) [Roijers et al., 2013a] is a tuple 〈S, A, T, R〉 where,

• S, A, and T are the same as in an MDP, but,

• R : S × A × S → ℝ^d is now a d-dimensional reward function, specifying the expected immediate vector-valued reward corresponding to a transition.

MOMDPs have recently been applied to many real-world decision problems, including: water reservoir control [Castelletti et al., 2013, 2008, Giuliani et al., 2015], where policies for releasing water from a dam must be found while balancing multiple uses of the reservoir, including hydroelectric production and flood mitigation; office building environmental control [Kwak et al., 2012], in which energy consumption must be minimized while maximizing the comfort of the building’s occupants; and medical treatment planning [Lizotte et al., 2010, 2012], in which the effectiveness of the treatment must be maximized, while minimizing the severity of the side effects.

In an MOMDP, when an agent executes a policy π, its value, V^π is vector-valued, as it is an expectation of the sum over vector-valued rewards, i.e.,

in the finite-horizon setting, and,

in the infinite horizon setting.

Конец ознакомительного фрагмента.

Текст предоставлен ООО «ЛитРес».

Прочитайте эту книгу целиком, купив полную легальную версию на ЛитРес.

Безопасно оплатить книгу можно банковской картой Visa, MasterCard, Maestro, со счета мобильного телефона, с платежного терминала, в салоне МТС или Связной, через PayPal, WebMoney, Яндекс.Деньги, QIWI Кошелек, бонусными картами или другим удобным Вам способом.

/9j/4Rl/RXhpZgAATU0AKgAAAAgABwESAAMAAAABAAEAAAEaAAUAAAABAAAAYgEbAAUAAAABAAAA agEoAAMAAAABAAIAAAExAAIAAAAeAAAAcgEyAAIAAAAUAAAAkIdpAAQAAAABAAAApAAAANAALcbA AAAnEAAtxsAAACcQQWRvYmUgUGhvdG9zaG9wIENTNiAoV2luZG93cykAMjAxNzowNDoyMCAxNDoz MDozMgAAA6ABAAMAAAABAAEAAKACAAQAAAABAAAIyqADAAQAAAABAAAK4gAAAAAAAAAGAQMAAwAA AAEABgAAARoABQAAAAEAAAEeARsABQAAAAEAAAEmASgAAwAAAAEAAgAAAgEABAAAAAEAAAEuAgIA BAAAAAEAABhJAAAAAAAAAEgAAAABAAAASAAAAAH/2P/tAAxBZG9iZV9DTQAB/+4ADkFkb2JlAGSA AAAAAf/bAIQADAg

Скачать книгу

Multi-Objective Decision Making. Diederik M. Roijers

Чтение книги онлайн.

Читать онлайн книгу Multi-Objective Decision Making - Diederik M. Roijers страница 10

Информация о книге:

Конец ознакомительного фрагмента.