570 likes | 682 Vues
Apprenticeship Learning for Robotic Control, with Applications to Quadruped Locomotion and Autonomous Helicopter Flight Pieter Abbeel Stanford University (Variation hereof) presented at Cornell/USC/UCSD/Michigan/UNC/Duke/UCLA/UW/EPFL/Berkeley/CMU Winter/Spring 2008
E N D
Apprenticeship Learning for Robotic Control, with Applications to Quadruped Locomotion and Autonomous Helicopter Flight Pieter Abbeel Stanford University (Variation hereof) presented at Cornell/USC/UCSD/Michigan/UNC/Duke/UCLA/UW/EPFL/Berkeley/CMU Winter/Spring 2008 In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov, Sebastian Thrun.
Big picture and key challenges Dynamics Model Psa Probability distribution over next states given current state and action Describes desirability of being in a state. Reinforcement Learning / Optimal Control Controller/Policy p Reward Function R Prescribes action to take for each state • Key challenges • Providing a formal specification of the control task. • Building a good dynamics model. • Finding closed-loop controllers.
Overview • Apprenticeship learning algorithms • Leverage expert demonstrations to learn to perform a desired task. • Formal guarantees • Running time • Sample complexity • Performance of resulting controller • Enabled us to solve highly challenging, previously unsolved, real-world control problems in • Quadruped locomotion • Autonomous helicopter flight
Problem setup • Input: • Dynamics model / Simulator Psa(st+1 | st, at) • No reward function • Teacher’s demonstration: s0, a0, s1, a1, s2, a2, … (= trace of the teacher’s policy *) • Desired output: • Policy , which (ideally) has performance guarantees, i.e., • Note: R* is unknown.
Prior work: behavioral cloning • Formulate as standard machine learning problem • Fix a policy class • E.g., support vector machine, neural network, decision tree, deep belief net, … • Estimate a policy from the training examples (s0, a0), (s1, a1), (s2, a2), … • E.g., Pomerleau, 1989; Sammut et al., 1992; Kuniyoshi et al., 1994; Demiris & Hayes, 1994; Amit & Mataric, 2002.
Prior work: behavioral cloning • Limitations: • Fails to provide strong performance guarantees • Underlying assumption: policy simplicity
Problem structure Dynamics Model Psa Prescribes action to take for each state: typically very complex Often fairly succinct Reinforcement Learning / Optimal Control Controller/Policy p Reward Function R
Apprenticeship learning [Abbeel & Ng, 2004] • Assume • Initialize: pick some controller 0. • Iterate for i = 1, 2, … : • “Guess” the reward function: Find a reward function such that the teacher maximally outperforms all previously found controllers. • Find optimal control policy i for the current guess of the reward function Rw. • If , exit the algorithm. Learning through reward functions rather than directly learning policies. There is no reward function for which the teacher significantly outperforms thus-far found policies.
Theoretical guarantees • Guarantee w.r.t. unrecoverable reward function of teacher. • Sample complexity does not depend on complexity of teacher’s policy *.
Related work • Prior work: • Behavioral cloning. (covered earlier) • Utility elicitation / Inverse reinforcement learning, Ng & Russell, 2000. • No strong performance guarantees. • Closely related later work: • Ratliff et al., 2006, 2007; Neu & Szepesvari, 2007; Syed & Schapire, 2008. • Work on specialized reward function: trajectories. • E.g., Atkeson & Schaal, 1997.
Highway driving Teacher in Training World Learned Policy in Testing World • Input: • Dynamics model / Simulator Psa(st+1 | st, at) • Teacher’s demonstration: 1 minute in “training world” • Note: R* is unknown. • Reward features: 5 features corresponding to lanes/shoulders; 10 features corresponding to presence of other car in current lane at different distances
More driving examples Driving demonstration Learned behavior Driving demonstration Learned behavior In each video, the left sub-panel shows a demonstration of a different driving “style”, and the right sub-panel shows the behavior learned from watching the demonstration.
Parking lot navigation • Reward function trades off: • curvature • smoothness, • distance to obstacles, • alignment with principal directions. [Abbeel et al., submitted]
Experimental setup • Demonstrate parking lot navigation on “train parking lot.” • Run our apprenticeship learning algorithm to find a set of reward weights w. • Receive “test parking lot” map + starting point and destination. • Find a policy for navigating the test parking lot. Learned reward weights
Quadruped • Reward function trades off 25 features. [Kolter, Abbeel & Ng, 2008]
Experimental setup • Demonstrate path across the “training terrain” • Run our apprenticeship learning algorithm to find a set of reward weights w. • Receive “testing terrain”---height map. • Find a policy for crossing the testing terrain. Learned reward weights
Apprenticeship learning Teacher’s flight Dynamics Model Psa (s0, a0, s1, a1, ….) Learn R Reinforcement Learning / Optimal Control Reward Function R Controller p
Apprenticeship learning Teacher’s flight Dynamics Model Psa (s0, a0, s1, a1, ….) Learn R Reinforcement Learning / Optimal Control Reward Function R Controller p
Motivating example Collect flight data. • How to fly for data collection? • How to ensure that the entire flight envelope is covered? • Textbook model • Specification • Textbook model • Specification Accurate dynamics model Psa Accurate dynamics model Psa Learn model from data.
Desired properties • Never any explicit exploration (neither manual nor autonomous). • Near-optimal performance (compared to teacher). • Small number of teacher demonstrations. • Small number of autonomous trials.
Apprenticeship learning of the model Autonomous flight Teacher’s flight Learn Psa Dynamics Model Psa Learn Psa (s0, a0, s1, a1, ….) (s0, a0, s1, a1, ….) Reinforcement Learning / Optimal Control Reward Function R Controller p
Theoretical guarantees No explicit exploration required. [Abbeel & Ng, 2005]
Apprenticeship learning summary Autonomous flight Teacher’s flight Learn Psa Dynamics Model Psa Learn Psa (s0, a0, s1, a1, ….) (s0, a0, s1, a1, ….) Learn R Reinforcement Learning / Optimal Control Reward Function R Controller p
Other relevant parts of the story • Learning the dynamics model from data • Locally weighted models • Exploiting structure from physics • Simulation accuracy at time-scales relevant for control • Reinforcement learning / Optimal control • Model predictive control • Receding horizon differential dynamic programming [Abbeel et al. 2005, 2006a, 2006b, 2007]
Related work • Bagnell & Schneider, 2001; LaCivita, Papageorgiou, Messner & Kanade, 2002; Ng, Kim, Jordan & Sastry 2004a (2001); • Roberts, Corke & Buskey, 2003; Saripalli, Montgomery & Sukhatme, 2003; Shim, Chung, Kim & Sastry, 2003; Doherty et al., 2004. • Gavrilets, Martinos, Mettler and Feron, 2002; Ng et al., 2004b. • Maneuvers presented here are significantly more challenging than those flown by any other autonomous helicopter.
Autonomous aerobatic flips (attempt) before apprenticeship learning Task description: meticulously hand-engineered Model: learned from (commonly used) frequency sweeps data
Experimental setup for helicopter • Our expert pilot demonstrates the airshow several times.
Experimental setup for helicopter • Our expert pilot demonstrates the airshow several times. • Learn a reward function---trajectory. • Learn a dynamics model.
Experimental setup for helicopter • Our expert pilot demonstrates the airshow several times. • Learn a reward function---trajectory. • Learn a dynamics model. • Find the optimal control policy for learned reward and dynamics model. • Autonomously fly the airshow • Learn an improved dynamics model. Go back to step 4.
Accuracy White: target trajectory. Black: autonomously flown trajectory.
Summary • Apprenticeship learning algorithms • Learn to perform a task from observing expert demonstrations of the task. • Formal guarantees • Running time. • Sample complexity. • Performance of resulting controller. • Enabled us to solve highly challenging, previously unsolved, real-world control problems.
Current and future work • Applications: • Autonomous helicopters to assist in in wildland fire fighting. • Fixed-wing formation flight: Estimated fuel savings for three aircraft formation: 20%. • Learning from demonstrations only scratches surface of the potential impact of work at intersection machine learning/control on robotics. • Safe autonomous learning. • More general advice taking.
Model Learning: Proof Idea • From initial pilot demonstrations, our model/simulator Psawill be accurate for the part of the state space (s,a) visited by the pilot. • Our model/simulator will correctly predict the helicopter’s behavior under the pilot’s controller *. • Consequently, there is at least one controller (namely *) that looks capable of flying the helicopter well in our simulation. • Thus, each time we solve for the optimal controller using the current model/simulator Psa, we will find a controller that successfully flies the helicopter according to Psa. • If, on the actual helicopter, this controller fails to fly the helicopter---despite the model Psa predicting that it should---then it must be visiting parts of the state space that are inaccurately modeled. • Hence, we get useful training data to improve the model. This can happen only a small number of times.