Multiagent rational decision making: searching and learning for “good” strategies

Enrique Munoz de Cote Multiagent rational decision making: searching and learning for “good” strategies What is “good” multiagent learning?

The prescriptive non-cooperative agenda [Shoham et al. 07] • We are interested in problems where an agent needs to interact in open environments integrated by other agents. What's a “good” strategy in this situation? Can the monkey find a “good” strategy? or does it need to learn? • View: single agent perspective of the multiagent problem. • Environment dependent.

Multiagent Reinforcement Learning Framework unknown world: learning stochastic games MDPs Decision Theory, Planning matrix games known world: solving Multiple agents Single-agent

Game theory and multiagent learning: brief backgrounds • Game theory • Stochastic games • Solution concepts • Multiagent learning • Solution concepts • Relation to game theory

backgrounds→game theory $ B A Stochastic games (SG) • SGs are good examples of how agents' behaviours depend on each other. • Strategies represent the way agents' behave • Strategies might change as a function of other strategies. Game theory mathematically captures behaviour in strategic situations.

backgrounds→game theory $ B A A Computational Example: SG version of chicken • actions: U, D, R, L, X • coin flip on collision • Semiwalls (50%) • collision = -5; • step cost = -1; • goal = +100; • discount factor = 0.95; • both can get goal. SG of chicken [Hu & Wellman, 03]

backgrounds→game theory $ B A Strategies on the SG of chicken Average expected reward: • (88.3,43.7); • (43.7,88.3); • (66,66); • (43.7,43.7); • (38.7,38.7); • (83.6,83.6)

backgrounds→game theory $ B A Equilibrium values Average total reward on equilibrium: • Nash • (88.3,43.7) very imbalanced, inefficient • (43.7,88.3) very imbalanced, inefficient • (53.6,53.6) ½ mix, still inefficient • Correlated • ([43.7,88.3],[43.7,88.3]); • Minimax • (43.7,43.7); • Friend • (38.7,38.7) Computationally difficult to find in general

Repeated Games • What if agents are allowed to play multiple times? • Strategies: • Can be a function of history • Can be randomized • Nash equilibrium still exists.

Computing strategies for repeated SGs • Complete information: solve • Exact or approximate solutions • Incomplete information: learn • The environment (as perceived by the agent) is not Markovian • Convergence is not guaranteed • Exceptions: zero-sum and team games • Unwanted cycles and unpredicted behaviour appear There are algorithms for solving and learning that use the same successive approximations to the Bellman equations to derive solution policies.

Learning equilibrium strategies in SGs • Multiagent RL updates are based on the Bellman equations (just as RL): • A value iteration (VI) algorithm solve for the optimal Q function • Finding a solution via VI depends on the operator Eq{·}: How can multiagent RL learn any of those strategies?

Defining optimality What’s A’s optimal strategy? • the safest • the one that minimizes the opponent's reward • the one that maximizes the opponent's reward • the socially stable one $ B A In an open environment, an optimal strategy is arguable and may be defined by several criteria.

Defining optimality: our criteria • Optimality: should obtain close to maximum utility against other best response algorithms. • Security: should guarantee a minimum lower bound utility. • Simplicity: should be intuitive to understand and implement. • Adaptivity: should learn how to behave optimal, and remain optimal (even if environment changes).

backgrounds→multiagent RL Observation: Reinforcement Learning updates • Q-learning converges to a BR strategy in MDPs Definition [best response]. A best response function BR(·) returns the set of all strategies that are optimalagainst the environment's joint strategy. observation 1: a learner's BR is optimal against fixed strategies. observation 2: a learner's BR can be modified by a change in the environment's fixed strategy. example environment: only agents with fixed strategies

Social Rewards Joint work with: Monica Babes Michael L. Littman • Shaping rewards and intrinsic motivations • Leader and follower strategies • Open questions

Social Rewards→motivations Social rewards: hints from the brain • We’re smart, but evolution doesn’t trust us to plan all that far ahead. • Evolution programs us to want things likely to bring about what we need: • taste -> nutrition • pleasure -> procreation • eye contact -> care • generosity -> cooperation

Social Rewards→motivations Is cooperation “pleasurable”? • fMRI study during repeated prisoner’s dilemma showed that humans perceive: + mutual cooperation - “internal rewards” (activity in the brain’s reward center)‏ defection

Social Rewards→snapshot Social Rewards: its telescoping effect • Objective: change the behavior of the learner by influencing its early experience. Shaping rewards [Ng et al., 99] Social rewards Intrinsic motivation [Singh et al., 04]

Social Rewards→snapshot Social Rewards: guiding learners to better equilibria • Objective: change the behavior of the learner by influencing its early experience. Shaping rewards [Ng et al., 99] Social rewards Intrinsic motivation [Singh et al., 04]

Social Rewards→introduction Leader and follower reasoning [Littman and Stone, 01] leaders • A leader strategy is able to guide a best response learner. • Assumption: the opponent will adapt to its decisions. $ In the example A is a leader and B is a follower B A followers • A best response learner is a follower. • Assumption: its behaviour doesn't hurt nobody.

Social Rewards→introduction wall center wall 0,0 -1,1 1,-1 -10,-10 center Leader strategies follower • Assumption: opponent is playing a best response. agent B agent A BRB(wall) = center RA(wall,center) = -1 leader BRB(center) = wall RA(center,wall) = 1 Matrix game of chicken. Leader fixed strategies

Social Rewards→introduction $ $A $B B A Leader mutual advantage strategies Mutual advantage Nash in the repeated game • Easy to say way: compute convex hull. • Easy to compute way: • Compute attack and defence strategies. • Compute mutual advantage strategy. • Use attack strategy as threat to deviations. One-shot Nash the SG version of the prisoner's dilemma [Munoz de Cote and Littman, 2008]

Social Rewards→methodology How can a learner be also a leader? • We influence the best response learner's early experience with special shaping rewards called “social rewards” • The learner starts as a leader • If opponent is not a BR follower, the social shaping is washed away.

Social Rewards→methodology Shaping Based on Potentials • Idea: each state is assigned a potential Φ(s)[Ng et al, 1999], • On each transition, utility is augmented with the difference in potential,

Social Rewards→algorithm The Q+shaping algorithm • Compute attack and defence strategies. • Compute mutual advantage strategy • For repeated matrix games use [Littman and Stone,2003] algorithm • For repeated stochastic games use [Munoz de Cote and Littman, 2008] algorithm • Compute the state values (potentials) for the mutual advantage strategy. • Initialize the Q-table with the potential based function F(s,s’). • Theorem[Wiewiora 03]: shaping based on potentials has the same effect as initializing the Q function with the potential values. • The attack strategy as threat to deviations will teach BR learners better mutual advantage strategies. Q+shaping's main objective is to lead or follow, as appropriate

Joint work with: Michael L. Littman A Polynomial-time Nash Equilibrium Algorithm for Repeated Stochastic Games

Repeated SG Nash algorithm→result Main Result • Given a repeated stochastic game, return a strategy profile that is a Nash equilibrium (specifically one whose payoffs match the egalitarian point) of the average payoff repeated stochastic game in polynomial time. • Concretely, we address the following computational problem: v2 egalitarian line v1 Convex hull of the average payoffs

v2 v1 2 Repeated SG Nash algorithm→result How? (the short story version) • Compute minimax (security) strategies. • Solve two linear programming problems. • The algorithm searches for a point: egalitarian line P where Convex hull of a hypothetical SG • P is the point with the highest egalitarian value.

v2 v1 2 Repeated SG Nash algorithm→result How? (the search for point P) folk folkEgal(U1,U2,ε) • Compute R=friend1, L=friend2 and attack1, attack2strategies • Find egalitarian point and its policy • If R is left of egalitarian line: P=R • elseIf Lis right of egalitarian line: P = L • Else egalSearch(R,L,T) L egalitarian line L P=R P=L R R Convex hull of a hypothetical SG

Repeated SG Nash algorithm→result Complexity • The algorithm involves solving MDPs (polynomial time) and other steps that also take polynomial time. • The algorithm is polynomial iff T is bounded by a polynomial. Result: • Running time. Polynomial in: • The discount factor (1 / (1 – γ) ); • The approximation factor (1 /ε)

Repeated SG Nash algorithm→experiments SG version of the PD game $ $A $B B A

Repeated SG Nash algorithm→experiments Compromise game $B $A A B

Repeated SG Nash algorithm→experiments Asymmetric game $A $B $A A B

Thanks for your attention!

Multiagent rational decision making: searching and learning for “good” strategies

Multiagent rational decision making: searching and learning for “good” strategies

Presentation Transcript

LECTURE 6: MULTIAGENT INTERACTIONS

Decision Making: How Individuals and Groups Arrive at Decisions

Chapter 9

Decision Making

Decision Making AI

Fundamentals of Decision Making

Decision-making in organizations

The Rational Decision-Making Process

Influence Diagrams for Robust Decision Making in Multiagent Settings

Rational Actor

Chapter 10 Decision Making

INDIVIDUAL DECISION MAKING

CHAPTER 6

Decision Theory

Making good choices

Decision Making

Mutually-guided Multi-agent Learning

Principles and Learning Objectives

Distributed Rational Decision Making