1 / 24

Mean Field Equilibria of Multi-Armed Bandit Games

Mean Field Equilibria of Multi-Armed Bandit Games. Ramki Gummadi (Stanford) Joint work with: Ramesh Johari (Stanford) Jia Yuan Yu (IBM Research, Dublin). Motivation. Classical MAB models have a single agent. What happens when other agents influence arm rewards?

harris
Télécharger la présentation

Mean Field Equilibria of Multi-Armed Bandit Games

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mean Field Equilibria of Multi-Armed Bandit Games RamkiGummadi (Stanford) Joint work with: Ramesh Johari (Stanford) Jia Yuan Yu (IBM Research, Dublin)

  2. Motivation • Classical MAB models have a single agent. • What happens when other agents influence arm rewards? • Do standard learning algorithms lead to any equilibrium?

  3. Examples • Wireless transmitters learningunknown channels with interference • Sellers learning about product categories:e.g. eBay • Positive externalities: social gaming.

  4. Example: Wireless Transmitters Channel A 0.8 ? Channel B 0.6

  5. Example: Wireless Transmitters Channel A 0.8 ; 0.9 ? Channel B 0.6 ; 0.1

  6. Modeling the Bandit Game • Perfect bayesian equilibrium • Implausible agent behavior. • Mean field model • Agents behave under an assumption of stationarity.

  7. Outline • Model • The equilibrium concept • Existence • Dynamics • Uniqueness and convergence • From finite system to limit model • Conclusion

  8. Mean Field Model of MAB Games • Discrete time; arms; rewards. • An agent at any time has • Agents `regenerate’ once every time slots. • is sampled i.i.d. with distribution . • is reset to zero vector.

  9. Mean Field Model of MAB Games • Policy,: maps to (randomized) arm E.g. UCB, Gittins index. • Population profile: Arm distribution of agents • Rewarddistribution Bernoulli of mean:

  10. A Single Agent’s Evolution • Current state: • Current type: • Agent picks an arm • Population profile • Transitions to new state where: with probability with probability

  11. Examples of Reward Functions • Negative externality: E.g. • Positive externality: E.g. • Non separable rewards: E.g.

  12. The Equilibrium Concept • What constitutes an MFE? • A joint distribution for • A population profile, • Policy that maps state to arm choice. • Equilibrium conditions for • has to be the unique invariant distribution for fixed population profile under . • arises from when agents adopt policy

  13. Optimality in Equilibrium • In an MFE, doesn’t change over time. • can be any “optimal” policy learning an i.i.d. reward environment.

  14. Existence of MFE Theorem : At least one MFE exists if is continuous in for every . • Proved using Brouwer’s fixed point theorem.

  15. Beyond Existence • MFE exists, but when is it unique? • Can agent dynamics find such an equilibrium even if it is unique? • How does the mean field model approximate a system with finitely many agents?

  16. Dynamics Arms 1 2 3 . i . n

  17. Dynamics Arms 1 2 3 . i . n Policy:

  18. Dynamics Arms 1 2 3 . i . n Policy: Transition kernel ()

  19. Dynamics Arms 1 2 3 . i . n Policy: Transition kernel ()

  20. Dynamics Theorem : Let denote map from to . Assume is - Lipschitz for every θ. Then is a contraction map (in total variation) if: • Proof uses a coupling argument on the bandit process, .

  21. Uniqueness and Convergence • Fixed points for MFE • For arbitrary initial , mean field evolution is: When is a contraction (w.r.t. ): • There exists a unique MFE • The mean field trajectory of measures converges to

  22. Finite Systems to Limit Model • Rewards depend on, the empirical population profile of agents. • is a random probability measure on the (state, type) space. • (In what sense) does as ? i.e. Could trajectories diverge after a long time even for large ?

  23. Approximation Property Theorem: As uniformly in when is a contraction. • Proof uses an artificial “auxiliary” system with rewards based on mean field profile. • Coupling of transitions to enable a bridge from finite to mean field limit via auxiliary system.

  24. Conclusion • Agent populations converge to a mean field equilibrium using classical bandit algorithms. • Large agent population effectively mitigates non-stationarityin MAB games. • Interesting theoretical results beyond existence: uniqueness, convergence and approximation. • Insights are more general than theorem conditions strictly imply.

More Related