1 / 20

Hierarchical Reinforcement Learning Using Graphical Models

Hierarchical Reinforcement Learning Using Graphical Models. Victoria Manfredi and Sridhar Mahadevan Rich Representations for Reinforcement Learning ICML’05 Workshop August 7, 2005. Introduction. Abstraction necessary to scale RL hierarchical RL Want to learn abstractions automatically

jersey
Télécharger la présentation

Hierarchical Reinforcement Learning Using Graphical Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hierarchical Reinforcement Learning Using Graphical Models Victoria Manfredi and Sridhar Mahadevan Rich Representations for Reinforcement Learning ICML’05 Workshop August 7, 2005

  2. Introduction • Abstraction necessary to scale RLhierarchical RL • Want to learn abstractions automatically • Other approaches • Find subgoals: McGovern & Barto’01, Simsek & Barto’04, Simsek, Wolfe, & Barto’05, Mannor et al ’04 … • Build policy hierarchy: Hengst’02 • Potentially proto-value functions: Mahadevan’05 • Our approach • Learn initial policy hierarchy using graphical model framework, then learn how to use policies using reinforcement learning and reward • Related to imitation • Price & Boutilier’03, Abbeel & Ng’04

  3. Outline • Dynamic Abstraction Networks • Approach • Experiments • Results • Summary • Future Work

  4. Dynamic Abstraction Network Attend ICML’05 P1 P1 F F Register P0 P0 Policy Hierarchy Obs Obs F1 F1 Bonn S1 S1 F0 F0 Conference Center S0 S0 State Hierarchy Obs Obs t=2 t=1 HHMM Fine, Singer, & Tishby’98 AHMM Bui, Venkatesh, & West’02 DAN Manfredi & Mahadevan’05 Just one realization of a DAN; others are possible

  5. Approach Expert Phase 1 Discrete variables? Continuous? How many state values? Levels? Observe Trajectories Learn DAN using EM Phase 2 e.g., SMDP Q-Learning Extract Abstractions Policy Improvement Hand-code Skills

  6. DANs vs MAXQ/HAMs DANs infer from training sequences • DANs • # of levels in state/policy hierarchies • # of values for each (abstract) state/policy node • Training sequences: (flat state,action) pairs • MAXQ [Dietterich’00] • # of levels, # of tasks at each level • Connections between levels • Initiation set for each task • Termination set for each task • HAMs [Parr & Russell’98] • # of levels • Hierarchy of stochastic finite state machines • Explicit action, call, choice, stop states

  7. Why Graphical Models? • Advantages of Graphical Models • Joint learning of multiple policy/state abstractions • Continuous/hidden domains • Full machinery of inference can be used • Disadvantages • Parameter learning with hidden variables is expensive • Expectation-Maximization can get stuck in local maxima

  8. Domain • Dietterich’s Taxi (2000) • States • Taxi Location (TL): 25 • Passenger Location (PL): 5 • Passenger Destination (PD): 5 • Actions • North, South, East, West • Pickup, Putdown • Hand-coded policies • GotoRed • GotoGreen • GotoYellow • GotoBlue • Pickup, Putdown

  9. Experiments TL TL PL PL PD PD Phase 1 • |S1| = 5, |S0| = 25, |1| = 6, |0| = 6 • 1000 sequences from SMDP Q-learner {TL, PL, PD, A}1 , … , {TL, PL, PD, A}n • Bayes Net Toolbox (Murphy’01) Phase 2 • SMDP Q-learning • Choose policy 1using -greedy • Compute most likely abstract state s0 given TL, PL, PD • Select action 0using Pr ( 0  1 = 1 , S0 = s0 ) Taxi DAN Policy Policy Policy Policy F F Action Action S1 S1 F1 F1 S0 S0 F0 F0

  10. Policy Improvement • Policy learned over DAN policies performs well • Each plot is average over 10 RL runs and 1 EM run

  11. Policy Recognition PD PU DAN Initial Passenger Loc Passenger Dest Policy 1 Policy 6 • Can (sometimes!) recognize a specific sequence of actions as composing a single policy

  12. Summary • Two-phased method for automating hierarchical RL using graphical models • Advantages • Limited info needed (# of levels, # of values) • Permits continuous and partially observable state/actions • Disadvantages • EM is expensive • Need mentor • Abstractions learned can be hard to decipher (local maxima?)

  13. Future Work • Approximate inference in DANs • Saria & Mahadevan’04: Rao-Blackwellized particle filtering for multi-agent AHMMs • Johns & Mahadevan’05: variational inference for AHMMs • Take advantage of ability to do inference in hierarchical RL phase • Incorporate reward in DAN

  14. Thank You Questions?

  15. Abstract State Transitions: S0 • Regardless of abstract P0 policy being executed, abstract S0 states self-transition with high probability • Depending on abstract P0 policy, may alternatively transition to one of a few abstract S0 states • Similarly for abstract S1 states and abstract P1 policies

  16. State Abstractions Abstract state to which agent is most likely to transition is a consequence, in part, of the learned state abstractions

  17. Semi-MDP Q-learning • Q(s,o)  Q(s,o) + •  [r +  maxoO – Q(s, o) – Q(s,o)] s • Q(s,o): activity-value for state s and activity o • : learning rate • : discount rate raised to the number of time steps o took • r: accumulated discounted reward since o began

  18. Abstract State S1 Transitions • Abstract state S1 transitions under abstract policy P1

  19. Expectation-Maximization (EM) • Hidden variables and unknown parameters • E(xpectation)-step • Assume parameters known and compute the conditional expected values for variables • M(aximization)-step • Assume variables observed and compute the argmax parameters

  20. Abstract State S0 Transitions • Abstract state S0 transitions under abstract policy P0

More Related