1 / 26

Fitted/ batch /model-based RL: A (sketchy, biased) overview(?)

Fitted/ batch /model-based RL: A (sketchy, biased) overview(?). Csaba Szepesv ári University of Alberta. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A. Contents. What, why? Constraints How? Model-based learning Model learning

axel-walter
Télécharger la présentation

Fitted/ batch /model-based RL: A (sketchy, biased) overview(?)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fitted/batch/model-based RL: A (sketchy, biased) overview(?) Csaba Szepesvári University of Alberta TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAA

  2. Contents • What, why? • Constraints • How? • Model-based learning • Model learning • Planning • Model-free learning • Averagers • Fitted RL

  3. Motto “Nothing is more practical than a good theory” [Lewin] “He who loves practice without theory is like the sailor who boards ship without a rudder and compass and never knows where he may cast.” [Leonardo da Vinci]

  4. What? Why? • What is batch RL? • Input: Samples (algorithm cannot influence samples) • Output: A good policy • Why? • Common problem • Sample efficiency -- data is expensive • Building block • Why not? • Too much work (for nothing?) – • “Don’t worry, be lazy!” • Old samples are irrelevant • Missed opportunities (evaluate a policy!?)

  5. Constraints • Large (infinite) state/action space • Limits on • Computation • Memory use

  6. How? • Model learning + planning • Model free • Policy search • DP • Policy iteration • Value iteration

  7. Model-based learning

  8. Model learning

  9. Model-based methods Problem 1: Should planning take into account the uncertainties in the model? (“robustification”) Problem 2: How to learn relevant, compact models? For example: How to reject irrelevant features and keep the relevant ones? Need: Tight integration of planning and learning! • Model-learning: How? • Model: What happens if ..? • Features vs. observations vs. states • System identification? •  Satinder! Carlos! Eric! … • Planning: How? • Sample + learning! (batch RL? ..but you can influence the samples) • What else? (Discretize? Nay..) • Pro: Model is good for multiple things • Contra: Problem is doubled: need of high fidelity models, good planning

  10. Planning

  11. Bad news.. • Theorem (Chow, Tsitsiklis ’89) • Markovian Decision Problems • d dimensional state space • Bounded transition probabilities, rewards • Lipschitz-continuous transition probabilities and rewards  Any algorithm computing an ²-approximation of the optimal value function needs (²-d) values of p and r. • What’s next then?? • Open: Policy approximation?

  12. The joy of laziness • Don’t worry, be lazy: • “If something is too hard to do, then it's not worth doing” • Luckiness factor: • “If you really want something in this life, you have to work for it - Now quiet, they're about to announce the lottery numbers!”

  13. Sparse lookahead trees [Kearns et al., ’02] • Idea: Computing a good action ´ planning  build a lookahead tree • Size of the tree: S = c |A|H (²) (unavoidable), whereH(²) = Kr/(²(1-°)) • Good news:S is independent of d! • Bad news: S is exponential in H(²) • Still attractive: Generic, easy to implement • Problem: Not really practical

  14. Idea..  Remi • Be more lazy • Need to propagate values from good leaves as early as possible • Why sample suboptimal actions at all? • Breadth-first  Depth-first! • Bandit algorithms  Upper Confidence Bounds  UCT • Similar ideas: • [Peret and Garcia, ’04] • [Chang et al., ’05] [KoSze ’06]

  15. Results: Sailing • ‘Sailing’: Stochastic shortest path • State-space size = 24*problem-size • Extension to two-player, full information games • Good results in go! ( Remi, David!) Open: Why (when) does UCTwork so well? Conjecture: When being (very) optimistic does not abuse search How to improve UCT?

  16. Random Discretization Method [Rust’97] • Method: • Random base points • Value function computed at these points (weighted importance sampling) • Compute values at other points at run-time (“half-lazy method”) • Why Monte-Carlo? Avoid grids! • Result: • State space: [0,1]d • Action space: finite • p(y|x,a), r(x,a) Lipschitz continuous, bounded • Theorem [Rust ’97]: • Theorem [Sze’01]: Poly samples are enough to come up with ²-optimal actions (poly dependence on H). Smoothness of the value function is not required Open: Can we improve the result by changing the distribution of samples? Idea: Presample + Follow the obtained policy Open: Can we get poly dependence on both d and H without representing a value function? (e.g. lookahead trees)

  17. Pegasus [Ng & Jordan ’00] • Idea: Policy search + method of common random numbers (“scenarios”) • Results: • Condition: Deterministic simulative model • Thm: Finite action space, finite complexity policy class  polynomial sample complexity • Thm: Infinite action spaces, Lipschitz continuity of trans.probs + rewards  polynomial sample complexity • Thm: Finitely computable models + policies  polynomial sample complexity • Pro: Nice results • Contra: Global search? What policy space? Problem 1: How to avoid global search? Problem 2: When can we find a good policy efficiently? How? Problem 3: How to choose the policy class?

  18. Other planning methods • Your favorite RL method! +Planning is easier than learning: You can reset the state! • Dyna-style planning with prioritized sweeping  Rich • Conservative policy iteration • Problem: Policy search, guaranteed improvement in every iteration • [K&L’00]: Bound for finite MDPs, policy class ´ all policies • [K’03]: Arbitrary policies, reduction-style result • Policy search by DP [Bagnell, Kakade, Ng & Schneider ’03] • Similar to [K’03], finite horizon problems • Fitted value iteration • ..

  19. Model-free: Policy Search • ???? Open: How to do it?? (I am serious) Open: How to evaluate a policy/policy gradient given some samples? (partial result: In the limit, under some conditions, policies can be evaluated [AnSzeMu’08])

  20. Model-free: Dynamic Programming Policy Iteration How to evaluate policies? Do good value functions give rise to good policies? Value Iteration Use action-value functions How to represent value functions? How to do the updates?

  21. Value-function based methods • Questions: • What representation to use? • How are errors propagated? • Averagers [Gordon ’95] ~ kernel methods • Vt+1 = ¦F T Vt • L1 theory • Can we have an L2 (Lp) theory? • Counterexamples [Boyan&Moore ’95, Baird’95, BeTsi’96] • L2 error propagation [Munos ’03 ’05]

  22. Fitted methods • Idea: • Use regression/classification with value/policy iteration • Notable examples: • Fitted Q-iteration • Use trees ( averagers; Damien!) • Use neural nets ( L2, Martin!) • Policy iteration • LSTD [Bradtke&Barto ’96, Boyan ‘99] BRM [AnSzeMu’06,’08] • LSPI: Use action-value functions + iterate [Lagoudakis & Parr ’01, ’03] • RL as classification [La & Pa ’03]

  23. Results for fitted algorithms • Results for LSPI/BRM-PI, FQI: • Finite action-, continuous state-space • Smoothness conditions on MDP • Representative training set • Function class (F) large (Bellman error of F is small), but controlled complexity •  Polynomial rates (similar to supervised learning) • FQI, continuous action-spaces • Similar conditions + restricted policy class  Polynomial rates, but bad scaling with the dimension of the action space Open: How to choose the function space in an adaptive way? ~ model selection in supervised learning Supervised learning does not work without model selection? Why would RL work?  NO, IT DOES NOT. Idea: Regularize!  Problem: How to evaluate policies? [AnSzeMu ’06-’08]

  24. Regularization

  25. Final thoughts • Batch RL: Flourishing area • Many open questions • More should! come soon! • Some good results in practice • Take computation cost seriously? • Connect to on-line RL?

  26. Batch RL Let’s switch to that policy – after all the paper says that learning converges at an optimal rate!

More Related