Multi-Objective Decision Making. Diederik M. RoijersЧитать онлайн книгу.
A is the action space, i.e., the set of actions the agent can take,
• T : S × A × S → [0,1] is the transition function, giving the probability of a next state given an action and a current state,
• R : S × A × S → ℝ is the reward function, specifying the immediate expected scalar reward corresponding to a transition.
• μ0 is the distribution over initial states s0.
At each timestep t, the agent observes the current state of the environment s ∈ S. When the agent takes an action a ∈ A the environment transitions to a new state s′.
The state in an MDP is Markovian, i.e., the current state s of the environment and the current action of the agent a are a sufficient statistic for predicting the next transition probabilities T(s′|s, a) and the associated expected immediate reward. The agent’s history, i.e., the states and actions that led to the current state, do not provide additional information in that respect.
The agent’s goal in an MDP is to find a policy π that maximizes the return Rt, which is some function of the rewards received from timestep t and onward. In the broadest sense, a policy can condition on everything that is known to the agent.
A state-indepedent value function Vπ specifies the expected return when following π from the initial state:
We further specify first the return, Rt, and then the policy, π.
Typically, the return is additive [Boutilier et al., 1999], i.e., it is a sum of the rewards attained at each timestep. When the returns are additive, Vπ becomes an expectation over this sum. In a finite-horizon setting there is a limited number of timesteps, h, and the sum is typically undiscounted:
In a discounted infinite-horizon setting, the number of timesteps is not limited, but there is a discount factor, 0 ≤ γ < 1, that specifies the relative importance of future rewards with respect to immediate rewards:
Which action at is chosen by the agent at each timestep t depends on its policy π. If the policy is stationary, i.e., it conditions only on the current state, then it can be formalized as π : S × A → [0, 1]: it specifies, for each state and action, the probability of taking that action in that state. We can then specify the state value function of a policy π:
for all t when st = s. The Bellman equation restates this expectation recursively for stationary policies:
Note that the Bellman equation, which forms the heart of most standard solution algorithms such as dynamic programming [Bellman, 1957a] and temporal difference methods [Sutton and Barto, 1998], explicitly relies on the assumption of additive returns. This is important because nonlinear scalarization functions f can interfere with this additivity property, making planning and learning methods that rely on the Bellman equation not directly applicable, as we discuss in Section 3.2.3.
State value functions induce a partial ordering over policies, i.e., π is better than or equal to π′ if and only if its value is greater for all states:
A special case of a stationary policy is a deterministic stationary policy, in which one action is chosen with probability 1 for every state. A deterministic stationary policy can be seen as a mapping from states to actions: π : S → A. For single-objective MDPs, there is always at least one optimal policy π, i.e.,
Theorem 2.7 For any additive infinite-horizon single-objective MDP, there exists a deterministic stationary optimal policy [Boutilier et al., 1999, Howard, 1960].
If more than one optimal policy exists, they share the same value function, known as the optimal value function V*(s) = maxπ Vπ(s). The Bellman optimality equation defines the optimal value function recursively:
Note that, because it maximizes over actions, this equation makes use of the fact that there is an optimal deterministic stationary policy. Because an optimal policy maximizes the value for every state, such a policy is optimal regardless of the initial state distribution μ0. However, the state-independent value (Equation 2.1) can be different for different initial state distributions. Using μ0, the state value function can be translated back into the state-independent value function (Equation 2.1):
2.3.2 MULTI-OBJECTIVE MARKOV DECISION PROCESSES
In many decision problems, such as the social robot in Figure 2.2, it is impossible, undesirable, or infeasible to define a scalar reward function, and we need a vector-valued reward function, leading to an MOMDP.
Definition 2.8 A multi-objective Markov decision process (MOMDP) [Roijers et al., 2013a] is a tuple 〈S, A, T, R〉 where,
• S, A, and T are the same as in an MDP, but,
• R : S × A × S → ℝd is now a d-dimensional reward function, specifying the expected immediate vector-valued reward corresponding to a transition.
MOMDPs have recently been applied to many real-world decision problems, including: water reservoir control [Castelletti et al., 2013, 2008, Giuliani et al., 2015], where policies for releasing water from a dam must be found while balancing multiple uses of the reservoir, including hydroelectric production and flood mitigation; office building environmental control [Kwak et al., 2012], in which energy consumption must be minimized while maximizing the comfort of the building’s occupants; and medical treatment planning [Lizotte et al., 2010, 2012], in which the effectiveness of the treatment must be maximized, while minimizing the severity of the side effects.
In an MOMDP, when an agent executes a policy π, its value, Vπ is vector-valued, as it is an expectation of the sum over vector-valued rewards, i.e.,
in the finite-horizon setting, and,
in the infinite horizon setting.