280 likes | 307 Vues
This research paper introduces the Convergence with Model Learning and Safety (CMLeS) algorithm, which achieves convergence to Nash equilibrium, targeted optimality against memory-bounded opponents, and safety against every other unknown agent in a multiagent learning setting.
 
                
                E N D
Convergence, Targeted Optimality, and Safety in Multiagent Learning Doran Chakraborty Peter Stone Learning Agent Research Group University of Texas, Austin TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA
Non-stationarity in environment Overlap of spheres of influence Ignoring other agents and treating every thing else as the environment can be sub-optimal Each agent needs to learn the behavior of other agents in its sphere of influence MULTIAGENT LEARNING Multiple Autonomous Agents
Multiagent Learning from a Game theoretic perspective • Agents are involved in a repeated matrix game • N-player N-action matrix game • On each time step, each agent just sees the joint action and hence the payoffs for ever agent Is there any way for an agent to ensure certain payoffs (if not the best possible) against unknown opponents?
Contributions • First Multiagent Learning Algorithm called Convergence with Model Learning and Safety (CMLeS) • In a n-player n-action repeated game achieves • Converges to Nash equilibrium with probability 1 in self play (Convergence) • Against a set of memory bounded counterparts of memory size at most Kmax, converges to playing close to the best response with a very high probability (Targeted-optimality) • Also holds for opponents which eventually become memory bounded • Achieves it in the best reported time complexity • Against every other unknown agent ensures the maximin payoff (Safety)
High level overview of CMLeS Try to coordinate to a Nash equilibrium assuming all other agents are CMLeS agents if all agents are CMLeS agents when other agents are not CMLeS agents Try to model the opponents as memory bounded with max memory size Kmax (plays MLeS) Convergence achieved if other agents are arbitrary if other agents are memory bounded with memory size Kmax Targeted Optimality achieved Safety achieved
A motivating example : Battle of Sexes Bob B S B 1,2 0,0 Alice S 0,0 2,1 Alice Bob • 3 Nash equilibria • 2 in pure strategies • 1 in mixed (each player goes to its preferred event 2/3 times)
Assume Hypothesis H0 = Bob is a CMLeS agent.
Assume Hypothesis H0 = Bob is a CMLeS agent • Assume the agents choose the mixed strategy Nash equilibrium • Alice plays B with prob 1/3 while Bob plays B with prob 2/3 Use a Nash equilibrium solver to compute a Nash strategy for Alice and Bob Np = 100 and εp = 0.1 Compute a schedule Np and εp p = 0
Assume Hypothesis H0 = Bob is a CMLeS agent • Assume the agents choose the mixed strategy Nash equilibrium • Alice plays B with prob 1/3 while Bob plays B with prob 2/3 Use a Nash equilibrium solver to compute a Nash strategy for Alice and Bob Np = 100 and εp = 0.1 Compute a schedule Np and εp p = 0 Play your own part of the Nash strategy for Np episodes
Assume Hypothesis H0 = Bob is a CMLeS agent • Assume the agents choose the mixed strategy Nash equilibrium • Alice plays B with prob 1/3 while Bob plays B with prob 2/3 Use a Nash equilibrium solver to compute a Nash strategy for Alice and Bob Np = 100 and εp = 0.1 Compute a schedule Np and εp p = 0,1,2,…… Play your own part of the Nash strategy for Np episodes Alice played a1 31% times and Bob played a1 65 % times NO any agent deviated by εp from its Nash strategy?
Assume Hypothesis H0 = Bob is a CMLeS agent • Assume the agents choose the mixed strategy Nash equilibrium • Alice plays B with prob 1/3 while Bob plays B with prob 2/3 p = 0,1,2,.. Use a Nash equilibrium solver to compute a Nash strategy for Alice and Bob Play according to a fixed behavior Signal Compute a schedule Np and εp Play your own part of the Nash strategy for Np episodes NO YES any agent deviated by εp from its Nash strategy? YES Check for Consistency?
Assume Hypothesis H0 = Bob is a CMLeS agent When Bob is a memory bounded agent p = 0,1,2,…… Use a Nash equilibrium solver to compute a Nash strategy for Alice and Bob Signal Compute a schedule Np and εp Play a1 Kmax+1 times Play your own part of the Nash strategy for Np episodes C == 0 Play a1 Kmax times followed by another random action apart from a1 NO YES C++ C == 1 any agent deviated by εp from its Nash strategy? C > 1 Play a1 Kmax+1 times YES Check for Consistency? Reject H0 and play MLeS NO
Contributions of CMLeS • First MAL algorithm that in a n-player n-action repeated game • Converges to Nash equilibrium with probability 1 in self play (Convergence) • Against a set of memory bounded counterparts of memory size at most Kmax, converges to playing close to the best response with a very high probability (Targeted-optimality) • Also holds for opponents which eventually become memory bounded • Achieves it in the best reported time complexity • Against every other unknown agent ensures safety eventually (Safety)
How to play against memory bounded opponents? • Play against memory bounded opponents can be modeled as a Markov Decision Process (MDP) Chakraborty and Stone (ECML’08) • The adversary induces the MDP and hence known as Adversary Induced MDP (AIM) • The state space of the AIM is all feasible joint histories of size K • The transition and reward function of the AIM is determined by the opponent’s strategy • Both K and opponent strategy unknown and hence needs to be figured out
Adversary Induced MDP (AIM) (B,B)(S,S) Time = t Alice plays action S • Assume Bob is a memory bounded opponent with K=2
Adversary Induced MDP (AIM) (B,B)(S,S) Time = t Alice plays action S (S,S)(S,?) • Assume Bob is a memory bounded opponent with K=2
Adversary Induced MDP (AIM) (B,B)(S,S) Time = t Probability with which Bob plays S for a memory of (B,B)(S,S) = 0.3 Probability with which Bob plays B for a memory of (B,B)(S,S) = 0.7 Reward = 2 Reward = 0 (S,S)(S,S) (S,S)(S,B) • Optimal policy for this AIM is the optimal way of playing against Bob • How to achieve it? Use MLeS
Flowchart of MLeS Start of episode t Compute the best estimate of K using FIND-K algorithm. Let that be k Run RMax assuming that the true underlying AIM is of size k Play the safety strategy YES NO Is k a valid value?
Flowchart of MLeS Start of episode t Compute the best estimate of K using FIND-K algorithm. Let that be k Run RMax assuming that the true underlying AIM is of size k Play the safety strategy YES NO Is k a valid value?
Find-K algorithm Figuring out the opponent memory size 0 1 2 3 4 … K K+1 K+2 KmaxKmax+1 ΔKt ΔK+1t ΔKmaxt Δ0t Δ1t amount of information lost by not modeling Bob as a k+1 memory sized opponent as opposed to a k memory sized opponent Δkt
Find-K algorithm Figuring out the opponent memory size 0 1 2 3 4 … K K+1 K+2 KmaxKmax+1 ΔKt ΔK+1t ΔKmaxt Δ0t Δ1t 0.05 0.001 0.01 0.4 0.3
Find-K algorithm Figuring out the opponent memory size 0 1 2 3 4 … K K+1 K+2 KmaxKmax+1 ΔKt ΔK+1t ΔKmaxt Δ0t Δ1t 0.05 0.001 0.01 0.4 0.3 σKmaxt σot σ1t σKt σK+1t 0.07 0.0001 0.0002 0.002 0.02
Find-K algorithm Figuring out the opponent memory size 0 1 2 3 4 … K K+1 K+2 KmaxKmax+1 ΔKt ΔK+1t ΔKmaxt Δ0t Δ1t 0.05 0.001 0.01 0.4 0.3 σKmaxt σot σ1t σKt σK+1t 0.07 0.0001 0.0002 0.002 0.02 δ/Kmax δ/Kmax δ/Kmax Picks K with prob at least 1- δ
Theoretical properties of MLeS • Find-K needs only polynomial number of visits to every feasible joint history of size K to find the true opponent memory size, or K, with probability at least 1 – δ • Polynomial in 1/δ and Kmax • Overall time complexity of computing a ε-best response against a memory bounded opponent is then polynomial in the size of feasible joint histories of size K, Kmax,1/δ and 1/ε • For opponents which cannot be modeled as a Kmax memory bounded opponent, it converges to safety strategy with probability 1, in the limit
Conclusion and Future Work • A new Multiagent learning algorithm • CMLeS • Convergence • Targeted optimality against memory bounded adversaries in the best reported time complexity • Safety • What if there is a mixed population of agents? • How to incorporate no-regret or bounded regret? • Agents in graphical games