1 / 23

Reinforcement Learning: Learning algorithms for Large state space

This book provides an overview of reinforcement learning algorithms for large state space problems. It covers topics such as basics, mathematical models (MDP), planning, value iteration, policy iteration, learning algorithms, and function approximation.

Télécharger la présentation

Reinforcement Learning: Learning algorithms for Large state space

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. לביצוע מיידי! • להתחלק לקבוצות • 2 או 3 בקבוצה • להעביר את הקבוצות • היום בסוף השיעור! • ספר Reinforcement Learning • הספר קיים online (גישה מהאתר של הסדנה)

  2. Reinforcement Learning:Learning algorithmsFunction Approximation Yishay Mansour Tel-Aviv University

  3. Outline • Week I: Basics • Mathematical Model (MDP) • Planning • Value iteration • Policy iteration • Week II: Learning Algorithms • Model based • Model Free • Week III: Large state space

  4. Learning Algorithms Given access only to actions perform: 1. policy evaluation. 2. control - find optimal policy. Two approaches: 1. Model based (Dynamic Programming). 2. Model free (Q-Learning, SARSA).

  5. Learning: Policy improvement • Assume that we can compute: • Given a policy π, • The V and Q functions of π • Can perform policy improvement: • Π= Greedy (Q) • Process converges if estimations are accurate.

  6. Learning - Model FreeOptimal Control: off-policy Learn online the Q function. Qt+1 (st ,at ) = Qt (st ,at )+ a At At = rt+g MAXa {Qt (st+1,a)} - Qt (st ,at ) OFF POLICY: Q-Learning Maximization Operator!!!

  7. Learning - Model FreePolicy evaluation: TD(0) An online view: At state st we performed action at, received reward rtand moved to state st+1. Our “estimation error” isAt =rt+gVt(st+1)-Vt(st), The update: Vt +1(st+1) = Vt(st ) + a At No maximization over actions!

  8. Learning - Model FreeOptimal Control: on-policy Learn online the optimal Q* function. Qt+1 (st ,at ) = Qt (st ,at )+ a [ rt+g Qt (st+1,at+1) - Qt (st ,at )] ON-Policy:SARSA at+1 the e-greedy policy for Qt. The policy selects the action! Need to balance exploration and exploitation.

  9. Modified Notation • Rather than Q(s,a) have Qa(s) • Greedy(Q) = MAXa Qa(s) • Each action has a function Qa(s) • Learn eachQa(s)independently!

  10. Large state space • Reduce number of states • Symmetries (x-o) • Cluster states • Define attributes • Limited number of attributes • Some state will be identical • Action view of a state

  11. Example X-O • For each action (square) • Consider row/diagonal/column through it • The state will encode the status of “rows”: • Two X’s • Two O’s • Mixed (both X and O) • One X • One O • empty • Only Three types of squares/actions

  12. Clustering states • Need to create attributes • Attributes should be “game dependent” • Different “real” states - same representation • How do we differentiate states? • We estimate action value. • Consider only legal actions. • Play “best” action.

  13. Function Approximation • Use a limited model for Qa(s) • Have an attribute vector: • Each state s has a vector vec(s)=x1 ... xk • Normally k << |S| • Examples: • Neural Network • Decision tree • Linear Function • Weights  = 1 ... k • Value  ixi

  14. Gradient Decent • Minimize Squared Error • Square Error = ½  P(s) [V(s) – V(s)]2 • P(s) is sum weighting on the states • Algorithm: • (t+1) = (t) +  [V(st) – V(t)(st)] (t) V(t)(st) • (t) = partial derivatives • Replace V(st) by a sample • Monte Carlo: use Rt forV(st) • TD(0) use At for [V(st) – V(t)(st)]

  15. Linear Functions • Linear function:  ixi = < ,x > • Derivative (t) Vt(st) = vec(st) • Update Rule: • t+1 = t +  [V(st) – Vt(st)] vec(st) • MC: t+1 = t +  [ Rt – < t ,st>] vec(st) • TD: t+1 = t +  At vec(st)

  16. Example: 4 in a row • Select attributes for action (column): • 3 in a row (type X or type O) • 2 in a row (type X or O) and [blocked/ not] • Next location 3 in a row. • Next move might lose • Other “features” • RL will learn the weights. • Look ahead significantly helps • use max-min tree

  17. Bootstraping • Playing against a “good” player • Using .... • Self play • Start with a random player • play against one self. • Choose a starting point. • Max-Min tree with simple scoring function. • Add some simple guidance • add “compulsory” moves.

  18. Scoring Function • Checkers: • Number of pieces • Number of Queens • Chess • Weighted sum of pieces • Othello/Reversi • Difference in number of pieces • Can be used with Max-Min Tree • (,) pruning

  19. Example: Revesrsi (Othello) • Use a simple score functions: • difference in pieces • edge pieces • corner pieces • Use Max-Min Tree • RL: optimize weights.

  20. Advanced issues • Time constraints • fast and slow modes • Opening • can help • End game • many cases: few pieces, • can be solved efficiently • Train on a specific state • might be helpful/ not sure that its worth the effort.

  21. What is Next? • Create teams: • at least 2 students at most 3 students • Group size will influence our expectations! • Choose a game! • Give the names and game • GUI for game • Deadline Dec. 25, 2005

  22. Schedule (more) • System specification • Project outline • High level components planning • Jan. 29, 2006 • Build system • Project completion • May 1, 2006 • All supporting documents in html!

  23. Next week • GUI interface (using C++) • Afterwards: • Each groups works by itself

More Related