1 / 54

Apprenticeship Learning Using Linear Programming

Apprenticeship Learning Using Linear Programming. ICML 2008. Apprenticeship Learning: An apprentice learns to behave by observing an expert. Learning algorithm. Output : Apprentice policy that is at least as good as expert policy (and possibly better).

vprice
Télécharger la présentation

Apprenticeship Learning Using Linear Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Apprenticeship LearningUsingLinear Programming ICML 2008

  2. Apprenticeship Learning: An apprentice learns to behave by observing an expert. Learning algorithm Output: Apprentice policy that is at least as good as expert policy (and possibly better). Input: Demonstrations by expert policy.

  3. Main Contribution A new apprenticeship learning algorithm that: Produces simpler apprentice policies, and Is empirically faster than previous algorithms.

  4. Outline Introduction Apprenticeship Learning Prior Work Summary of Advantages Over Prior Work Background: Occupancy Measure Linear Program for Apprenticeship Learning (LPAL) Experiments and Demos Other Topics

  5. Given:Same as Markov Decision Process, except no reward function R. Also given: Basis reward functions R1, …, Rk. Demonstrations by an expert policy ¼E. Assume: True reward function R is a weighted combination of the k basis reward functions: R(s, a) = iwi* ¢ Ri(s, a) where weight vector w* is unknown. Goal: Find apprentice policy ¼A such that V(¼A) ¸ V(¼E) where value V(¼) of policy ¼ is with respect to unknown reward function. Apprenticeship Learning

  6. Outline Introduction Apprenticeship Learning Prior Work Summary of Advantages Over Prior Work Background: Occupancy Measure Linear Program for Apprenticeship Learning (LPAL) Experiments and Demos Other Topics

  7. Define Vi(¼) = ith “basis value” of ¼ = Value of ¼ with respect to ith basis reward function. Then true value of a policy is a weighted combination of its basis values, i.e. V(¼) = iwi*¢Vi(¼) Proof: Linearity of expectation. Prior Work – Key Idea

  8. Introduced the apprenticeship learning framework. Algorithm Idea: Estimate Vi(¼E) for all i from expert’s demonstrations. Find ¼A such that Vi(¼A) = Vi(¼E) for all i. Theorem: V(¼A) = V(¼E) Algorithm “type”:Geometric Prior Work – Abbeel & Ng (2004)

  9. Assumed that all wi* are non-negative and sum to 1. Algorithm Idea: Estimate Vi(¼E) for all i from expert’s demonstrations. Find ¼A and M such that: Vi(¼A) ¸Vi(¼E) + M for all i, and M is as large as possible. Theorem: V(¼A) ¸ V(¼E), and possibly V(¼A) À V(¼E). Algorithm “type”:Boosting Prior Work – Syed & Schapire (2007)

  10. Outline Introduction Apprenticeship Learning Prior Work Summary of Our Approach Background: Occupancy Measure Linear Program for Apprenticeship Learning (LPAL) Experiments and Demos Other Topics

  11. Same algorithm idea as Syed & Schapire (2007), but formulated as a single linear program, which we give to an off-the-shelf solver. Summary of Our Approach

  12. Previous algorithms: Didn’t actually output a single stationary policy ¼A, but instead output a distributionD over a set of stationary policies, such that E¼ »D[V(¼A)] ¸ V(¼E) Our algorithm: Outputs a single stationary policy ¼A such that V(¼A) ¸ V(¼E), and possibly V(¼A) À V(¼E). Advantage: Apprentice policy is simpler and more intuitive. Advantages of Our Approach A

  13. Advantages of Our Approach • Previous algorithms: Ran for several rounds, and each round required solving a standard MDP (expensive). • Our algorithm: A single linear program. • Advantage: Empirically faster than previous algorithms. • We informally conjecture that this is because it solves the problem “all at once”.

  14. Outline Introduction Apprenticeship Learning Prior Work Summary of Our Approach Background: Occupancy Measure Linear Program for Apprenticeship Learning (LPAL) Experiments and Demos Other Topics

  15. The occupancy measurex¼2R|S||A| of policy ¼is an alternate way of describing how ¼ moves through the state-action space. x¼sa = Expected (discounted) number of visits by policy ¼ to state-action pair (s, a). Example: Suppose policy ¼ visits state-action pair (s, a): With probability 1 at time 1. With probability 1/2 at time 2. With probability 1/3 at time 3. With probability 0 a time ¸ 4. Then x¼sa = 1 + °(1/2) + °2(1/3) Occupancy Measure

  16. Occupancy Measure – Equivalent Representation The relationship between a stationary policy ¼and its occupancy measure x¼is given by: Proof: Left-hand side = Probability of taking action a in state s. Right-hand side = No. of visits to state-action (s, a) No. of visits to state s. Significance: It is easy to recover a stationary policy from its occupancy measure.

  17. Define º(x) ,s,a R(s, a) ¢xsa Then V(¼) = º(x¼). In other words, º(x) is the value of a policy whose occupancy measure is x. Proof: A policy earns reward R(s, a) (suitably discounted) every time it visits state-action pair (s, a). Significance: The value of a policy is a linear function of its occupancy measure. Occupancy Measure – Calculating Value

  18. Occupancy Measure – Bellman Flow Constraints The Bellman flow constraints are a set of constraints that any vector x2R|S||A| must satisfy to be a valid occupancy measure. The Bellman flow constraints say: “Under any policy, the number of visits into a state s must equal the number of visits leaving state s.” s Must be equal

  19. Occupancy Measure – Bellman Flow Constraints In fact, the Bellman flow constraints completely characterize the set of occupancy measures. x satisfies the Bellman flow constraints m x is the occupancy measure of some policy ¼ Significance: The Bellman flow constraints are linear in x. All policies All occupancy measures ¢¼ x ¢ Bellman flow constraints

  20. Outline Introduction Apprenticeship Learning Prior Work Summary of Our Approach Background: Occupancy Measure Linear Program for Apprenticeship Learning (LPAL) Experiments and Demos Other Topics

  21. Derivation of LPAL Algorithm 1. Estimate Vi(¼E) for all i from expert’s demonstrations. 2. Find ¼A and M that solve: max M subject to: Vi(¼A) - Vi(¼E) ¸ M for all i. ¼A is a policy. Start with algorithm idea from Syed & Schapire (2007)

  22. Derivation of LPAL Algorithm 1. Estimate Vi(¼E) for all i from expert’s demonstrations. 2. Find ¼A and M that solve: max M subject to: Vi(¼A) - Vi(¼E) ¸ M for all i. ¼A is a policy. xA satisfies the Bellman flow constraints. xA

  23. Derivation of LPAL Algorithm 1. Estimate Vi(¼E) for all i from expert’s demonstrations. 2. Find ¼A and M that solve: max M subject to: Vi(¼A) - Vi(¼E) ¸ M for all i. ¼A is a policy. xA satisfies the Bellman flow constraints. xA ºi(xA)

  24. Derivation of LPAL Algorithm 1. Estimate Vi(¼E) for all i from expert’s demonstrations. 2. Find¼Aand M that solve: max M subject to: Vi(¼A) - Vi(¼E) ¸ M for all i. ¼A is a policy xA satisfies the Bellman flow constraints. xA ¼A ºi(xA) Vi(¼A) ¼A is a policy.

  25. Derivation of LPAL Algorithm 1. Estimate Vi(¼E) for all i from expert’s demonstrations. 2. Find xA and M that solve: max M subject to: ºi(xA) - Vi(¼E) ¸ M for all i. xA satisfies the Bellman flow constraints.

  26. Derivation of LPAL Algorithm 1. Estimate Vi(¼E) for all i from expert’s demonstrations. 2. Find xA and M that solve: max M subject to: ºi(xA) - Vi(¼E) ¸ M for all i. xA satisfies the Bellman flow constraints. This is a linear program!

  27. Derivation of LPAL Algorithm 1. Estimate Vi(¼E) for all i from expert’s demonstrations. 2. Find xA and M that solve: max M subject to: ºi(xA) - Vi(¼E) ¸ M for all i. xA satisfies the Bellman flow constraints. “Of all occupancy measures …

  28. Derivation of LPAL Algorithm 1. Estimate Vi(¼E) for all i from expert’s demonstrations. 2. Find xA and M that solve: max M subject to: ºi(xA) - Vi(¼E) ¸ M for all i. xA satisfies the Bellman flow constraints. “Of all occupancy measures … find one corresponding to a policy that is better than the expert’s policy …

  29. Derivation of LPAL Algorithm 1. Estimate Vi(¼E) for all i from expert’s demonstrations. 2. Find xA and M that solve: max M subject to: ºi(xA) - Vi(¼E) ¸ M for all i. xA satisfies the Bellman flow constraints. “Of all occupancy measures … find one corresponding to a policy that is better than the expert’s policy … by as much as possible.”

  30. Derivation of LPAL Algorithm 1. Estimate Vi(¼E) for all i from expert’s demonstrations. 2. Find xA and M that solve: max M subject to: ºi(xA) - Vi(¼E) ¸ M for all i xA satisfies the Bellman flow constraints. “Of all occupancy measures … find one corresponding to a policy that is better than the expert’s policy … by as much as possible.” 3. Convert occupancy measure to a stationary policy:

  31. Theorem: V(¼A) ¸ V(¼E), and possibly V(¼A) À V(¼E). (same as Syed & Schapire (2007)) Proof: Almost immediate. Remark: We could have applied the same occupancy measure “trick” to the algorithm idea from Abbeel & Ng (2004), and likewise derived a linear program. LPAL Algorithm

  32. Outline Introduction Apprenticeship Learning Prior Work Summary of Our Approach Background: Occupancy Measure Linear Program for Apprenticeship Learning (LPAL) Experiments and Demos Other Topics

  33. Experiment – Setup • Actions & transitions: North, South, East and West, with 30% chance of moving to a random state. • Basis rewards: One indicator basis reward per region. • Expert: Optimal policy for randomly chosen weight vector w*. “Gridworld” environment, divided into regions Region State

  34. Experiment – Setup • Compare: • Projection algorithm (Abbeel & Ng 2004) • MWAL algorithm (Syed & Schapire 2007) • LPAL algorithm (this work) • Evaluation Metric: Time required to learn apprentice policy whose value is 95% of the optimal value.

  35. Experiment – Results Note: Y-axis is log scale.

  36. Experiment – Results Note: Y-axis is log scale.

  37. Demo – Mimicking the Expert Output of LPAL algorithm Expert

  38. Demo – Improving Upon the Expert Output of LPAL algorithm Expert

  39. Outline Introduction Apprenticeship Learning Prior Work Summary of Our Approach Background: Occupancy Measure Linear Program for Apprenticeship Learning (LPAL) Experiments and Demos Other Topics

  40. Also Discussed in the Paper… • We observe that the MWAL algorithm often performs better than its theory predicts it should. • We have new results explaining this behavior (in preparation).

  41. Constrained MDPs and RL with multiple rewards: Feinberg and Schwartz (1996) Gabor, Kalmar and Szepesvari (1998) Altman (1999) Shelton (2000) Dolgov and Durfree (2005) … Max margin planning: Ratliff, Bagnell and Zinkevich (2006) Related to Our Approach

  42. Recap A new apprenticeship learning algorithm that: Produces simpler apprentice policies, and Is empirically faster than previous algorithms. Thanks! Questions?

  43. Algorithms for finding ¼A: Max-Margin: Based on quadratic programming. Projection: Based on a geometric approach. MWAL: Based on a multiplicative weights approach, similar to boosting. Actually, existing algorithms don’t find a single stationary ¼A, but instead find a distributionD over a set of stationary policies, such that E¼»D[V(¼A)] ¸ V(¼E). Drawbacks: Apprentice policy is nonintuitive and complicated to describe. Algorithms have slow empirical running time. Prior Work – Details

  44. Same algorithm idea: Find ¼A such that Vi(¼A) ¸Vi(¼E) for all i. Algorithm to find ¼A based on linear programming. Allows us to leverage the efficiency of modern LP solvers. Outputs a single stationary policy ¼A such that V(¼A) ¸ V(¼E). Benefits: Apprentice policy is simpler. Algorithm is empirically much faster. This Work

  45. Same algorithm idea: Find ¼A such that Vi(¼A) ¸Vi(¼E) for all i. Algorithm to find ¼A based on linear programming. Allows us to leverage the efficiency of modern LP solvers. Outputs a single stationary policy ¼A such that V(¼A) ¸ V(¼E). Benefits: Apprentice policy is simpler. Algorithm is empirically much faster. This Work

  46. The occupancy measurex¼2R|S||A| of policy ¼is an alternate way of describing how ¼ moves through the state-action space. x¼sa = Expected (discounted) number of visits by policy ¼ to state-action pair (s, a). Occupancy Measure – Equivalent Representation

  47. Given the occupancy measure x¼of a policy ¼, computing the basis values of ¼is easy. Define ºi(x) ,s,a Ri(s, a) ¢xsa Then V_i(\pi) = \nu_i(x^\pi). In other words, \nu_i(x) is the ith basis value of a policy whose occupancy measure is x. Proof: Policy ¼ earns (suitably discounted) reward Ri(s, a) for each visit to state-action pair (s, a). And x¼sa is the expected (and suitably discounted) number of visits by policy ¼ to state-action pair (s, a). Occupancy Measure – Calculating Value

  48. Occupancy Measure – Calculating Value Given the occupancy measure x¼of a policy ¼, computing the basis values of ¼is easy: Vi(¼) = s,a Ri(s, a) ¢x¼(s, a) Proof: Linearity of expectation. For convenience, define: ºi(x) , s,a Ri(s, a) ¢x(s, a)

  49. Define º(x) ,s,a R(s, a) ¢xsa Then V(¼) = º(x¼). In other words, º(x) is the value of a policy whose occupancy measure is x. Proof: Policy ¼ earns (discounted) reward R(s, a) for each visit to state-action pair (s, a). And x¼sa is the expected (discounted) number of visits by policy ¼ to state-action pair (s, a). So V(\pi) is Occupancy Measure – Calculating Value

  50. Occupancy Measure – Bellman Flow Constraints The Bellman flow constraints are a set of linear constraints that define the set of all occupancy measures. x satisfies the Bellman flow constraints m x is the occupancy measure of some policy ¼ All policies All occupancy measures ¢¼ x ¢ Bellman flow constraints

More Related