Multi-Objective Decision Making. Diederik M. RoijersЧитать онлайн книгу.
abbreviation is also used for mixed-observability MDPs [Ong et al., 2010], which we do not consider here; we use the abbreviation MOMDPs solely for multi-objective MDPs.
CHAPTER 2
Multi-Objective Decision Problems
In this chapter, we introduce the concept of a multi-objective decision problem. Then we describe two concrete classes of multi-objective decision problems that we use throughout the book: multi-objective coordination graphs and multi-objective Markov decision processes. However, before introducing the concrete multi-objective decision problems, we first introduce their single-objective counterparts.
2.1 MULTIPLE OBJECTIVES
In this book, we focus on different (cooperative) multi-objective decision problems. Multi-objective decision problems contrast single-objective decision problems, which are more common in the literature. In their most general form, single-objective decision problems can be defined as a set of policies and a value function that we wish to maximize:
Definition 2.1 A cooperative single-objective decision problem (SODP), consists of:
• a set of allowed (joint) policies Π,
• a value function that assigns a real numbered value, Vπ ∈ ℝ, to each joint policy π ∈ Π, corresponding to the desirability, i.e., the utility, of the policy.
Definition 2.2 In a cooperative multi-objective decision problem (MODP), Π is the same as in
an SODP, but
• there are d ≥ 1 objectives, and
• the value function assigns a value vector, Vπ ∈ ℝd, to each joint policy π ∈ Π, corresponding to the desirability of the policy with respect to each objective.
We denote the value of policy π in the i-th objective as
Both Vπ and Π may have particular forms that affect the way they should be computed, as we discuss in Chapter 3. For example, there may be multiple agents, each with its own action space. In such settings, a solution consists of a joint policy containing a local policy for each agent. We assume that Π is known and that it is in principle possible to compute the value of each (joint) policy. Furthermore, we assume that, if there are multiple agents, these agents are cooperative.
Definition 2.3 A multi-agent MODP is cooperative if and only if all agents get the same (team) value, Vπ, for executing a joint policy π ∈ Π i.e., there are no individual rewards. A single-agent MODP is cooperative by default.
This definition of cooperative is common in the field of decision theory, e.g., in multi-agent MDPs [Boutilier, 1996, Scharpff et al., 2016] and Dec-POMDPs [Oliehoek and Amato, 2016]. However, the term “cooperative” is used differently in cooperative game theory [Chalkiadakis et al., 2011, Igarashi and Roijers, 2017], in which agents form coalitions on the basis of their individual utilities. In this book, we consider only decision problems that are cooperative according to Definition 2.3.
In an SODP, the value function provides a complete ordering on the joint policies, i.e., for each pair of policies π and π′, Vπ must be greater than, equal to, or less than
In the unknown weights and decision support scenarios (Figure 1.1), the parameters of the scalarization function w, or even f itself, are unknown during the planning or learning phases. Therefore, an algorithm for solving an MODP should return a set of policies that contains an optimal policy for each possible w. Given such a solution set, the user can pick the policy that maximizes her utility in the selection phase. We want the solution set to contain at least one optimal policy for every possible scalarization (in order to guarantee optimality), but we also want the solution set to be as small as possible, in order to make the selection phase as efficient as possible. We discuss which solution sets are optimal, and how this can be derived from different assumptions about the scalarization function f (Definition 1.1), and the set of permitted policies Π in the MODP in Chapter 3. In the rest of this section, we introduce two different MODP problem classes.
2.2 MULTI-OBJECTIVE COORDINATION
The first class of MODPs that we treat is the multi-objective coordination graph (MO-CoG).1 In a MO-CoG, multiple agents need to coordinate their actions in order to be effective. For example, in the mining problem of Figure 1.2, each agent represents a van with workers from a single village. Each of these vans can go to different mines within reasonable traveling distance, leading to a set of different possible actions for each agent. Each mine yields a different expected amount of gold (the first objective) and silver (the second objective). Because mining can be done more efficiently when more workers are present at a mine, it is vitally important that the different agents (i.e., vans) coordinate which mines they go to.
Other examples of problems that can be modeled as a MO-CoG are: risk-sensitive combinatorial auctions, in which we want to maximize the total revenue, while minimizing the risk for the auctioneer [Marinescu, 2011], and maintenance scheduling for offices in which the energy consumption, costs, and overtime for the maintenance staff must all be minimized [Marinescu, 2011].
2.2.1 SINGLE-OBJECTIVE COORDINATION GRAPHS
Before we formally define MO-CoGs, we first define the corresponding single-objective problem, i.e., coordination graphs (CoGs) [Guestrin et al., 2002, Kok and Vlassis, 2004]. In the context of coordination graphs, the notion of reward is typically referred to as payoff in the literature. Payoff is usually denoted u (for utility). We adopt this terminology and notation.
Definition 2.4 A coordination graph (CoG) [Guestrin et al., 2002, Kok and Vlassis, 2004] is a tuple 〈D, A, U〉, where
• D = {1,…, n} is the set of n agents,
• A = Ai × … × An is the joint action space: the Cartesian product of the finite action spaces of all agents. A joint action is thus a tuple containing an action for each agent a = 〈a1,…, an〉, and
• U = {u1,…, uρ} is the set of ρ scalar local payoff functions, each of which has limited scope, i.e., it depends on only a subset of the agents. The total team payoff is the sum of the local payoffs:
In