230 likes | 273 Vues
This collaboration explores planning and learning strategies in repeated games with hidden states, aiming to achieve Nash equilibria and maximize rewards. It delves into strategies, Nash equilibrium pairs, tit-for-tat, security levels, and alternate actions to stabilize cooperation. The algorithm finds Nash pairs efficiently, with a symmetric Markov game approach for convergence to Nash outcomes.
E N D
Collaboration inRepeated Games Michael L. Littman mlittman@cs.rutgers.edu Rutgers University
Motivation Create agents that achieve their goals, perhaps working together. • Separate payoff functions • General sum Compute Nash equilibria (stable strategies) • Algorithm assigns strategies • Not learning (at present) Planning and Learning with Hidden State
A B Grid Game 3 (Hu & Wellman 01) U, D, R, L, X No move on collision Semiwalls (50%) -1 for step , -10 for collision, +100 for goal Both can get goal. Planning and Learning with Hidden State
Repeated Markov Game S: Finite set of states A1, A2: Finite set of action choices R1(s, a1, a2): Payoff to first player R2(s, a1, a2): Payoff to second player P(s’| s, a1, a2): Transition function G: Goal (terminal) states (subset of S) Objective: maximize average (over repetitions) total reward Planning and Learning with Hidden State
Nash Equilibrium Pair of strategies such that neither has incentive to deviate unilaterally. • can be a function of history • can be randomized Always exists. Claim: Assumes games are repeated; players choose best response. Planning and Learning with Hidden State
A B Nash in Grid Game Average total: • (97, 48) • (48, 97) • (- ,-) (not Nash) • (64, 64) (not Nash) • (75, 75)? Planning and Learning with Hidden State
A A A B A A B B B B Collaborative Solution Average total: • (96, 96) (not Nash) A won’t wait. B changes incentives. Planning and Learning with Hidden State
Repeated Matrix Game One-state Markov game A1 = A2 = {cooperate, defect}: PD One (single-step) Nash Planning and Learning with Hidden State
Nash-Value Problem Computational problem: • Given one-state Markov game (two tables) • Find a Nash (always exists) • Return each player’s value. In NP co-NP; exact complexity open. Useful subproblem for Markov games. Planning and Learning with Hidden State
Two Special Cases Saddle-point equilibrium • Deviation helps other player. • Value is unique solution to zero-sum game. Coordination equilibrium • Both players get maximum reward possible • Value is unique max value Planning and Learning with Hidden State
Tit-for-Tat Saddle point, not coordination. Consider: cooperate, defect iff defected on. Better (3) than with defect-defect (1). Planning and Learning with Hidden State
Tit-For-Tat is Nash (D,C) = 5 (C,C) = 3 Cooperation (TFT) is best response C: C, D: D = 3 C: C, D: C = 3 C: D, D: D = 1 C: D, D: C = 2.5 C D (C,D) = 0 (D,D) = 1 Planning and Learning with Hidden State
Generalized TFT TFT stablizes mutually beneficial outcome. General class of policies: • Play beneficial action • Punish deviation to suppress temptation Need to generalize both components. Planning and Learning with Hidden State
Security Level Values achievable without collaboration • Solution of zero-sum game (LP) • Can force a player to this value • Player can guarantee this value • Possibly stochastic Useful as a punishment, but also threshold PD: (1,1) Planning and Learning with Hidden State
Two-Player Plot Mark payoff for each combo of actions Mark security level (C,D) (C,C) (D, D) (D,C) Planning and Learning with Hidden State
Dominating Pair Let (s1, s2) be security-level values. Dominating pair of actions (a1, a2): c1 = R1(a1, a2), c2 = R2(a1, a2) c1 > s1, c2 > s2 ti is temptation payoff (t1 = maxa R1(a, a2)). ni > (ti-ci)/(ci-si) punishments sufficient to stablize cooperation (folk thm). Planning and Learning with Hidden State
Alternation Repeat one, then the other. Repeat. If security below convex hull, can stablize. Planning and Learning with Hidden State
Algorithm Find Nash pair for repeated matrix game in polynomial time. Idea: • If convex hull, use generalized TFT. • Else, can find one-step Nash quickly. Planning and Learning with Hidden State
Proof Sketch Try improving S1 policy. If can’t, try improving S2. If can’t, Nash. If can, won’t hurt player 1 and player 1 can’t improve, so Nash. (S1, S2*) (S1, S2) (S1*, S2) Planning and Learning with Hidden State
Symmetric Case R1(a, a’)= R2(a’, a) Value of game just maximum average! Alternate or accept security-level. Planning and Learning with Hidden State
A B A B Symmetric Markov Game Episodic Roles chosen randomly Algorithm: • Maximize sum (MDP) • Security-level (0-sum) • Choose max if better Converges to Nash. Planning and Learning with Hidden State
Conclusion Threats can help (Littman & Stone 01) Find repeated Nash in polynomial time Very simple structure for symmetric games Applies to Markov games Planning and Learning with Hidden State
Discussion Objectives in game theory/RL for agents? Desiderata? How learn state space when repeated? Multiobjective negotiation? Learning: combine leading and following? Different unknown discount rates?? Incomplete rationality? Incomplete information of rewards? Planning and Learning with Hidden State