1 / 25

Optimism in the Face of Uncertainty: a Unifying approach

Optimism in the Face of Uncertainty: a Unifying approach. István Szita & András Lőrincz Eötvös Loránd University Hungary. Outline. background quick overview of exploration methods construction of the new algorithm analysis & experimental results outlook. Background.

conway
Télécharger la présentation

Optimism in the Face of Uncertainty: a Unifying approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimism in the Face of Uncertainty:a Unifying approach István Szita & András Lőrincz Eötvös Loránd University Hungary

  2. Outline • background • quick overview of exploration methods • construction of the new algorithm • analysis & experimental results • outlook

  3. Background • Markov decision processes • finite, discounted • (…but wait until the end of the talk) • value function-based methods • Q(x,a) values • the efficient exploration problem

  4. Basic exploration: -greedy • extremely simple • sufficient for convergence in the limit • for many classical methods like Q-learning, Dyna, Sarsa • …under suitable conditions • extremely inefficient

  5. Advanced exploration • in case of uncertainty, be optimistic! • …details vary • we will use concepts from • R-max • optimistic initial values • exploration bonus methods • model-based interval estimation • there are many others, • Bayesian methods • UCT • delayed Q-learning • …

  6. R-max (Brafman &Tennenholz, 2001) + • builds model from observations • uses an optimistic model • unknown transitions go to “garden of Eden” (hypothetical state with max. reward) • transitions declared known after O(nVisits3) steps poly-time convergence − slow in practice

  7. Optimistic initial values + • set initial values high: • no extra work • usually combined with other techniques • with very high initial values, no need for additional exploration no extra work − wears out slowly only model-free

  8. Exploration bonus methods (e.g. Mealeau & Bourgine, 1999; many others) + • bonus reward for “interesting” states • rarely visited, large TD-error, etc. • exact size/form varies • can oscillate fervently • regular/bonus rewards accumulated in separate value functions can be efficientin practice − ad-hoc method bonuses do not converge

  9. Model-based interval estimation (Wiering, 1998; Strehl & Littman, 2006) + • builds model from observations • estimates confidence intervals of state values • exploration bonus: half-widths of intervals poly-time convergence − ???

  10. Assembling the new algorithm • model estimation: sum of rewards for all (x,a,y) up to t number of visits to (x,a,y) up to t number of visits to (x,a) up to t

  11. Assembling the new algorithm II • Optimistic initial model: • a single visit to xE from each (x,a) really optimistic!

  12. Assembling the new algorithm II cf. optimistic initial values: no extra work after initialization • Optimistic initial model: • a single visit to xE from each (x,a) cf. R-max: hypothetical “Eden” statewith max. reward really optimistic!

  13. Assembling the new algorithm III • in each step t, • at := greedy with respect to Qt(xt,¢) • perform at, observe next state and reward • update counters, model parameters • solve model MDP • ... can be done incrementally & fast, e.g.: • a few steps of value iteration • asynchronously, by prioritized sweeping • get new value function Qt+1

  14. Assembling the new algorithm IV • Potential problem: Rmax is too large! • separate real/bonus rewards! initialize to 0 add “real” rewards initialize to 0 or Rmax add nothing we can use it at any time!

  15. Assembling the new algorithm IV • Potential problem: Rmax is too large! • separate real/bonus rewards! cf. exploration bonus methods initialize to 0 add “real” rewards initialize to 0 or Rmax add nothing exploration bonus! we can use it at any time!

  16. Convergence results • One parameter: Rmax • for large Rmax, converges to near-optimum (with high probability) • proof is based on MBIE’s proof (and R-max, E3) • by the time the bonus becomes small !numVisits is large !model estimate is accurate • bonus is (instead of MBIE’s ) • looser bound (but polynomial!)

  17. Experimental results I (Strehl & Littman, 2006) • “RiverSwim” • “SixArms”

  18. Experimental results II (Meuleau & Bourgine, 1999; Strens, 2000; Dearden, 2000) • “Chain” • “Loop”

  19. Experimental results III (Meuleau & Bourgine, 1999; Strens, 2000; Dearden, 2000) • “FlagMaze”

  20. +1000 +500 +500 Experimental results IV (Wiering & Schmidhuber, 1998) • “Maze with subgoals”

  21. Outlook • extension to factored MDPs: almost ready • (we need benchmarks) • extension to general function approximation: in progress

  22. Advantages of OIM • polynomial-time convergence (to near-optimum, with high probability) • convincing performance in practice • extremely simple to implement • all work done at initialization • decision making is always greedy • Matlab source code to be released soon

  23. Thank you for your attention! check our web pages at http://szityu.web.eotvos.elte.hu http://inf.elte.hu/lorincz or my reinforcement learning blog “Gimme Reward” at http://gimmereward.wordpress.com

  24. Full pseudocode of the OIM algorithm

  25. Exact statement of the convergence theorem

More Related