1 / 17

Multiple timescales for multiagent learning

Multiple timescales for multiagent learning. David Leslie and E. J. Collins University of Bristol. David Leslie is supported by CASE Research Studentship 00317214 from the UK Engineering and Physical Sciences Research Council in cooperation with BAE SYSTEMS. Introduction.

mcghie
Télécharger la présentation

Multiple timescales for multiagent learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple timescales for multiagent learning David Leslie and E. J. Collins University of Bristol David Leslie is supported by CASE Research Studentship 00317214 from theUK Engineering and Physical Sciences Research Council in cooperation with BAE SYSTEMS.

  2. Introduction • Learning in iterated normal form games. • Simple environment. • Theoretical properties of multiagent Q-learning.

  3. Notation • players. • Player plays mixed strategy . • Opponent mixed strategy . • Expected reward for playing is . • estimated by .

  4. Mixed strategies • Mixed equilibria necessary. • Mixed strategies from values. • Boltzmann smoothing with fixed temperature parameter .

  5. Fixed temperatures • Nash distribution approximates Nash equilibrium. • No discontinuities. • True convergence to mixed strategies.

  6. Q-learning • Standard Q-learning, except for division by . • is the indicator function, is the reward. • Learning parameters satisfy

  7. Player 1 Player 2 1 point if choice matches player 2 1 point if choice matches player 3 1 point if choice is opposite to player 1 Player 3 Three player pennies

  8. A plot of Q values

  9. Stochastic approximation • Relate to an ODE. • implies values track • Deterministic, continuous time system.

  10. Analysis of the example • Unique fixed point. • Small temperatures make fixed point unstable - a periodic orbit is stable. • Explains cycling of values.

  11. Multiple timescales - I • Generalise stochastic approximation. • for . • The quicker , the slower the process adapts.

  12. Multiple timescales - II • Fast processes can fully adapt to slow processes. • Slow processes see fast processes as having completely converged. • Will work if the fast processes converge to a unique value for each fixed value of the slow processes.

  13. Multiple-timescalesQ-learning assumption • Assume that for fixed the values of will converge to a unique value, resulting in joint best response . • For example, holds for two-player games and for cyclic games.

  14. Convergence of multiple-timescales Q-learning • Behaviour determined by the ODE • Can prove convergence if player 1 has only two actions. • Hence process converges for three player pennies.

  15. Another plot of Q values

  16. Conclusion • Theoretical study of multiagent learning. • Fixed temperature parameter to achieve mixed equilibria from values. • Multiple timescales assists convergence and enables theoretical study.

  17. Future work • Investigate when the convergence assumption must hold. • Experiments with multiple-timescales learning in Markov games. • Theoretical results for multiple-timescales learning in Markov games.

More Related