Collaboration in Repeated Games: Planning and Learning with Hidden State

Collaboration inRepeated Games Michael L. Littman mlittman@cs.rutgers.edu Rutgers University

Motivation Create agents that achieve their goals, perhaps working together. • Separate payoff functions • General sum Compute Nash equilibria (stable strategies) • Algorithm assigns strategies • Not learning (at present) Planning and Learning with Hidden State

A B Grid Game 3 (Hu & Wellman 01) U, D, R, L, X No move on collision Semiwalls (50%) -1 for step , -10 for collision, +100 for goal Both can get goal. Planning and Learning with Hidden State

Repeated Markov Game S: Finite set of states A1, A2: Finite set of action choices R1(s, a1, a2): Payoff to first player R2(s, a1, a2): Payoff to second player P(s’| s, a1, a2): Transition function G: Goal (terminal) states (subset of S) Objective: maximize average (over repetitions) total reward Planning and Learning with Hidden State

Nash Equilibrium Pair of strategies such that neither has incentive to deviate unilaterally. • can be a function of history • can be randomized Always exists. Claim: Assumes games are repeated; players choose best response. Planning and Learning with Hidden State

A B Nash in Grid Game Average total: • (97, 48) • (48, 97) • (- ,-) (not Nash) • (64, 64) (not Nash) • (75, 75)? Planning and Learning with Hidden State

A A A B A A B B B B Collaborative Solution Average total: • (96, 96) (not Nash) A won’t wait. B changes incentives. Planning and Learning with Hidden State

Repeated Matrix Game One-state Markov game A1 = A2 = {cooperate, defect}: PD One (single-step) Nash Planning and Learning with Hidden State

Nash-Value Problem Computational problem: • Given one-state Markov game (two tables) • Find a Nash (always exists) • Return each player’s value. In NP  co-NP; exact complexity open. Useful subproblem for Markov games. Planning and Learning with Hidden State

Two Special Cases Saddle-point equilibrium • Deviation helps other player. • Value is unique solution to zero-sum game. Coordination equilibrium • Both players get maximum reward possible • Value is unique max value Planning and Learning with Hidden State

Tit-for-Tat Saddle point, not coordination. Consider: cooperate, defect iff defected on. Better (3) than with defect-defect (1). Planning and Learning with Hidden State

Tit-For-Tat is Nash (D,C) = 5 (C,C) = 3 Cooperation (TFT) is best response C: C, D: D = 3 C: C, D: C = 3 C: D, D: D = 1 C: D, D: C = 2.5 C D (C,D) = 0 (D,D) = 1 Planning and Learning with Hidden State

Generalized TFT TFT stablizes mutually beneficial outcome. General class of policies: • Play beneficial action • Punish deviation to suppress temptation Need to generalize both components. Planning and Learning with Hidden State

Security Level Values achievable without collaboration • Solution of zero-sum game (LP) • Can force a player to this value • Player can guarantee this value • Possibly stochastic Useful as a punishment, but also threshold PD: (1,1) Planning and Learning with Hidden State

Two-Player Plot Mark payoff for each combo of actions Mark security level (C,D) (C,C) (D, D) (D,C) Planning and Learning with Hidden State

Dominating Pair Let (s1, s2) be security-level values. Dominating pair of actions (a1, a2): c1 = R1(a1, a2), c2 = R2(a1, a2) c1 > s1, c2 > s2 ti is temptation payoff (t1 = maxa R1(a, a2)). ni > (ti-ci)/(ci-si) punishments sufficient to stablize cooperation (folk thm). Planning and Learning with Hidden State

Alternation Repeat one, then the other. Repeat. If security below convex hull, can stablize. Planning and Learning with Hidden State

Algorithm Find Nash pair for repeated matrix game in polynomial time. Idea: • If convex hull, use generalized TFT. • Else, can find one-step Nash quickly. Planning and Learning with Hidden State

Proof Sketch Try improving S1 policy. If can’t, try improving S2. If can’t, Nash. If can, won’t hurt player 1 and player 1 can’t improve, so Nash. (S1, S2*) (S1, S2) (S1*, S2) Planning and Learning with Hidden State

Symmetric Case R1(a, a’)= R2(a’, a) Value of game just maximum average! Alternate or accept security-level. Planning and Learning with Hidden State

A B A B Symmetric Markov Game Episodic Roles chosen randomly Algorithm: • Maximize sum (MDP) • Security-level (0-sum) • Choose max if better Converges to Nash. Planning and Learning with Hidden State

Conclusion Threats can help (Littman & Stone 01) Find repeated Nash in polynomial time Very simple structure for symmetric games Applies to Markov games Planning and Learning with Hidden State

Discussion Objectives in game theory/RL for agents? Desiderata? How learn state space when repeated? Multiobjective negotiation? Learning: combine leading and following? Different unknown discount rates?? Incomplete rationality? Incomplete information of rewards? Planning and Learning with Hidden State

Collaboration in Repeated Games: Planning and Learning with Hidden State