560 likes | 930 Vues
Hierarchical Reinforcement Learning. Mausam. [A Survey and Comparison of HRL techniques]. The Outline of the Talk. MDPs and Bellman’s curse of dimensionality. RL: Simultaneous learning and planning. Explore avenues to speed up RL. Illustrate prominent HRL methods.
 
                
                E N D
Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]
The Outline of the Talk • MDPs and Bellman’s curse of dimensionality. • RL: Simultaneous learning and planning. • Explore avenues to speed up RL. • Illustrate prominent HRL methods. • Compare prominent HRL methods. • Discuss future research. • Summarise
Environment What action next? Percept Action Decision Making Slide courtesy Dan Weld
Personal Printerbot • States (S) :{loc,has-robot-printout, user-loc,has-user-printout},map • Actions (A) :{moven,moves,movee,movew, extend-arm,grab-page,release-pages} • Reward (R) : if h-u-po +20 else -1 • Goal (G) : All states with h-u-po true. • Start state: A state with h-u-po false.
Episodic Markov Decision Process Episodic MDP ´ MDP with absorbing goals • hS, A, P, R, G, s0i • S : Set of environment states. • A: Set of available actions. • P: Probability Transition model. P(s’|s,a)* • R: Reward model. R(s)* • G: Absorbing goal states. • s0 : Start state. • : Discount factor**. * Markovian assumption. ** bounds R for infinite horizon.
Goal of an Episodic MDP Find a policy (S!A), which: • maximises expected discounted reward for a • a fully observable* Episodic MDP. • if agent is allowed to execute for an indefinite horizon. * Non-noisy complete information perceptors
Solution of an Episodic MDP • Define V*(s) : Optimal reward starting in state s. • Value Iteration : Start with an estimate of V*(s) and successively re-estimate it to converge to a fixed point.
Complexity of Value Iteration • Each iteration – polynomial in |S| • Number of iterations – polynomial in |S| • Overall – polynomial in |S| • Polynomial in |S| -  |S| : exponential in number of features in the domain*. * Bellman’s curse of dimensionality
The Outline of the Talk • MDPs and Bellman’s curse of dimensionality. • RL: Simultaneous learning and planning. • Explore avenues to speed up RL. • Illustrate prominent HRL methods. • Compare prominent HRL methods. • Discuss future research. • Summarise
Gain knowledge • Gain understanding • Gain skills • Modification of behavioural tendency Learning Environment Data
Gain knowledge • Gain understanding • Gain skills • Modification of behavioural tendency What action next? Decision Making while Learning* Environment Percepts Datum Action * Known as Reinforcement Learning
Reinforcement Learning • Unknown Pand reward R. • Learning Component : Estimate the Pand R values via data observed from the environment. • Planning Component : Decide which actions to take that will maximise reward. • Exploration vs. Exploitation • GLIE (Greedy in Limit with Infinite Exploration)
Learning • Model-based learning • Learn the model, and do planning • Requires less data, more computation • Model-free learning • Plan without learning an explicit model • Requires a lot of data, less computation
Q-Learning • Instead of learning, P and R, learn Q* directly. • Q*(s,a) : Optimal reward starting in s, if the first action is a, and after that the optimal policy is followed. • Q* directly defines the optimal policy: Optimal policy is the action with maximum Q* value.
Q-Learning • Given an experience tuple hs,a,s’,ri • Under suitable assumptions, and GLIE exploration Q-Learning converges to optimal. New estimate of Q value Old estimate of Q value
Semi-MDP: When actions take time. • The Semi-MDP equation: • Semi-MDP Q-Learning equation: where experience tuple is hs,a,s’,r,Ni r = accumulated discounted reward while action a was executing.
Printerbot • Paul G. Allen Center has 85000 sq ft space • Each floor ~ 85000/7 ~ 12000 sq ft • Discretise location on a floor: 12000 parts. • State Space (without map) : 2*2*12000*12000 --- very large!!!!! • How do humans do the decision making?
The Outline of the Talk • MDPs and Bellman’s curse of dimensionality. • RL: Simultaneous learning and planning. • Explore avenues to speedup RL. • Illustrate prominent HRL methods. • Compare prominent HRL methods. • Discuss future research. • Summarise
1. The Mathematical PerspectiveA Structure Paradigm • S: Relational MDP • A: Concurrent MDP • P: Dynamic Bayes Nets • R: Continuous-state MDP • G: Conjunction of state variables • V: Algebraic Decision Diagrams • : Decision List (RMDP)
2. Modular Decision Making • Go out of room • Walk in hallway • Go in the room
2. Modular Decision Making • Humans plan modularly at different granularities of understanding. • Going out of one room is similar to going out of another room. • Navigation steps do not depend on whether we have the print out or not.
3. Background Knowledge • Classical Planners using additional control knowledge can scale up to larger problems. • (E.g. : HTN planning, TLPlan) • What forms of control knowledge can we provide to our Printerbot? • First pick printouts, then deliver them. • Navigation – consider rooms, hallway, separately, etc.
A mechanism that exploits all three avenues : Hierarchies • Way to add a special (hierarchical) structure on different parameters of an MDP. • Draws from the intuition and reasoning in human decision making. • Way to provide additional control knowledge to the system.
The Outline of the Talk • MDPs and Bellman’s curse of dimensionality. • RL: Simultaneous learning and planning. • Explore avenues to speedup RL. • Illustrate prominent HRL methods. • Compare prominent HRL methods. • Discuss future research. • Summarise
Hierarchy • Hierarchy of : Behaviour, Skill, Module, SubTask, Macro-action, etc. • picking the pages • collision avoidance • fetch pages phase • walk in hallway • HRL ´ RL with temporally extended actions
Hierarchical Algos ´ Gating Mechanism • Hierarchical Learning • Learning the gating function • Learning the individual behaviours • Learning both * g is a gate bi is a behaviour *Can be a multi- level hierarchy.
Option : Movee until end of hallway • Start : Any state in the hallway. • Execute : policy as shown. • Terminate : when s is end of hallway.
Options [Sutton, Precup, Singh’99] • An option is a well defined behaviour. • o = hIo, o, oi • Io:Set of states (IoµS) in which o can be initiated. • o(s): Policy (S!A*) when o is executing. • o(s) : Probability that o terminates in s. *Can be a policy over lower level options.
Learning • An option is temporally extended action with well defined policy. • Set of options (O) replaces the set of actions (A) • Learning occurs outside options. • Learning over options ´ Semi MDP Q-Learning.
Movew Moven Moven Return Movew Moves Moves Return Machine: Movee + Collision Avoidance : End of hallway Call M1 Movee Choose Obstacle Call M2 End of hallway Return M1 M2
Hierarchies of Abstract Machines[Parr, Russell’97] • A machine is a partial policy represented by a Finite State Automaton. • Node : • Execute a ground action. • Call a machine as a subroutine. • Choose the next node. • Return to the calling machine.
Hierarchies of Abstract Machines • A machine is a partial policy represented by a Finite State Automaton. • Node : • Execute a ground action. • Call a machine as subroutine. • Choose the next node. • Return to the calling machine.
Learning • Learning occurs within machines, as machines are only partially defined. • Flatten all machines out and consider states [s,m] where s is a world state, and m, a machine node ´MDP • reduce(SoM) : Consider only states where machine node is a choice node ´Semi-MDP. • Learning ¼ Semi-MDP Q-Learning
Task Hierarchy: MAXQ Decomposition[Dietterich’00] Root Children of a task are unordered Fetch Deliver Take Give Navigate(loc) Extend-arm Grab Release Extend-arm Moven Moves Movew Movee
MAXQ Decomposition • Augment the state s by adding the subtask i : [s,i]. • Define C([s,i],j) as the reward received in i after j finishes. • Q([s,Fetch],Navigate(prr)) = V([s,Navigate(prr)])+C([s,Fetch],Navigate(prr))* • Express V in terms of C • Learn C, instead of learning Q Reward received while navigating Reward received after navigation *Observe the context-free nature of Q-value
The Outline of the Talk • MDPs and Bellman’s curse of dimensionality. • RL: Simultaneous learning and planning. • Explore avenues to speedup RL. • Illustrate prominent HRL methods. • Compare prominent HRL methods. • Discuss future research. • Summarise
1. State Abstraction • Abstract state : A state having fewer state variables; different world states maps to the same abstract state. • If we can reduce some state variables, then we can reduce on the learning time considerably! • We may use different abstract states for different macro-actions.
State Abstraction in MAXQ • Relevance : Only some variables are relevant for the task. • Fetch : user-loc irrelevant • Navigate(printer-room) : h-r-po,h-u-po,user-loc • Fewer params for V of lower levels. • Funnelling : Subtask maps many states to smaller set of states. • Fetch : All states map to h-r-po=true, loc=pr.room. • Fewer params for C of higher levels.
State Abstraction in Options, HAM • Options : Learning required only in states that are terminal states for some option. • HAM : Original work has no abstraction. • Extension: Three-way value decomposition*: Q([s,m],n) = V([s,n]) +C([s,m],n) + Cex([s,m]) • Similar abstractions are employed. *[Andre,Russell’02]
2. Optimality Hierarchical Optimality vs. Recursive Optimality
Optimality • Options : Hierarchical • Use (A[O) : Global** • Interrupt options • HAM : Hierarchical* • MAXQ : Recursive* • Interrupt subtasks • Use Pseudo-rewards • Iterate! * Can define eqns for both optimalities **Adv. of using macro-actions maybe lost.
3. Language Expressiveness • Option • Can only input a complete policy • HAM • Can input a complete policy. • Can input a task hierarchy. • Can represent “amount of effort”. • Later extended to partial programs. • MAXQ • Cannot input a policy (full/partial)
4. Knowledge Requirements • Options • Requires complete specification of policy. • One could learn option policies – given subtasks. • HAM • Medium requirements • MAXQ • Minimal requirements
5. Models advanced • Options : Concurrency • HAM : Richer representation, Concurrency • MAXQ : Continuous time, state, actions; Multi-agents, Average-reward. • In general, more researchers have followed MAXQ • Less input knowledge • Value decomposition
6. Structure Paradigm • S: Options, MAXQ • A: All • P: None • R: MAXQ • G: All • V: MAXQ • : All
The Outline of the Talk • MDPs and Bellman’s curse of dimensionality. • RL: Simultaneous learning and planning. • Explore avenues to speedup RL. • Illustrate prominent HRL methods. • Compare prominent HRL methods. • Discuss future research. • Summarise
Directions for Future Research • Bidirectional State Abstractions • Hierarchies over other RL research • Model based methods • Function Approximators • Probabilistic Planning • Hierarchical P and Hierarchical R • Imitation Learning
Directions for Future Research • Theory • Bounds (goodness of hierarchy) • Non-asymptotic analysis • Automated Discovery • Discovery of Hierarchies • Discovery of State Abstraction • Apply…
P2 P1 D2 D1 Parts Ware-house Assemblies D3 D4 P3 P4 Applications • Toy Robot • Flight Simulator • AGV Scheduling • Keepaway soccer Images courtesy various sources