1 / 25

An Application of Reinforcement Learning to Autonomous Helicopter Flight

An Application of Reinforcement Learning to Autonomous Helicopter Flight. Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Stanford University. Overview. Autonomous helicopter flight is widely accepted to be a highly challenging control/reinforcement learning (RL) problem.

Télécharger la présentation

An Application of Reinforcement Learning to Autonomous Helicopter Flight

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Application of Reinforcement Learningto Autonomous Helicopter Flight Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Stanford University

  2. Overview • Autonomous helicopter flight is widely accepted to be a highly challenging control/reinforcement learning (RL) problem. • Human expert pilots significantly outperform autonomous helicopters. • Apprenticeship learning algorithms use expert demonstrations to obtain good controllers. • Our experimental results significantly extend the state of the art in autonomous helicopter aerobatics.

  3. Apprenticeship learning and RL Hard to specify the reward function for complex tasks such as helicopter aerobatics. Unknown dynamics: flight data is required to obtain an accurate model. • Apprenticeship learning: uses an expert demonstration to help select the model and the reward function. Dynamics Model Psa Reinforcement Learning Reward Function R Control policy p

  4. Learning the dynamical model Have good model of dynamics? • State-of-the-art: E3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.) NO YES “Explore” “Exploit”

  5. Learning the dynamical model Have good model of dynamics? • State-of-the-art: E3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.) Exploration policies are impractical: they do not even try to perform well. NO YES Can we avoid explicit exploration and just exploit? “Explore” “Exploit”

  6. Aggressive manual exploration

  7. Apprenticeship learning of the model Autonomous flight Expert human pilot flight Dynamics Model Psa Learn Psa Learn Psa (a1, s1, a2, s2, a3, s3, ….) (a1, s1, a2, s2, a3, s3, ….) Reinforcement Learning Reward Function R Control policy p Theorem. The described procedure will return policy as good as the expert’s policy in a polynomial number of iterations. [Abbeel & Ng, 2005]

  8. Learning the dynamics model • Details of algorithm for learning dynamics model: • Gravity subtraction [Abbeel, Ganapathi & Ng, 2005] • Lagged criterion [Abbeel & Ng, 2004]

  9. Autonomous nose-in funnel

  10. Autonomous tail-in funnel

  11. Apprenticeship learning: reward Hard to specify the reward function for complex tasks such as helicopter aerobatics. Dynamics Model Psa Reinforcement Learning Reward Function R Control policy p

  12. Example task: flip T T T g g g g T T T T T g g g g • Ideal flip: rotate 360 degrees around horizontal axis going right to left through the helicopter. 1 2 3 4 5 6 7 8 g

  13. Example task: flip (2) • Specify flip task as: • Idealized trajectory • Reward function that penalizes for deviation. • Reward function +

  14. Example of a bad reward function

  15. Apprenticeship learning for the reward function • Our approach: • Observe expert’s demonstration of task. • Infer reward function from demonstration. [see also Ng & Russell, 2000] • Algorithm: Iterate for t = 1, 2, … • Inverse RL step: • Estimate expert’s reward function R(s)= wT(s) such that under R(s) the expert outperforms all previously found policies {i}. • RL step: • Compute optimal policy tfor the estimated reward function.

  16. Theoretical Results: Convergence • Theorem. After a number of iterations polynomial in the number of features and the horizon, the algorithm outputs a policy  that performs nearly as well as the expert, as evaluated on the unknown reward function R*(s)=w*T(s). [Abbeel & Ng, 2004]

  17. Overview Dynamics Model Psa Reinforcement Learning Control policy p Reward Function R

  18. Optimal control algorithm • Differential dynamic programming [Jacobson & Mayne, 1970; Anderson & Moore, 1989] • An efficient algorithm to (locally) optimize a policy for continuous state/action spaces.

  19. DDP design choices and lessons learned • Simplest reward function: penalize for deviation from the target states for each time. • Penalize for high frequency control inputs significantly improves the controllers. • To allow aggressive maneuvering, we use a two-step procedure: • Make a plan off-line. • Penalize for high frequency deviations from the planned inputs. • Penalize for integrated orientation error. [See paper for details.] • Process noise has little influence on controllers’ performance. • Observation noise and delay in observations greatly affect the controllers’ performance. Insufficient: resulting controllers perform very poorly.

  20. Autonomous stationary flips

  21. Autonomous stationary rolls

  22. Related work • Bagnell & Schneider, 2001; LaCivita et al., 2006; Ng et al., 2004a; Roberts et al., 2003; Saripalli et al., 2003.; Ng et al., 2004b; Gavrilets, Martinos, Mettler and Feron, 2002. • Maneuvers presented here are significantly more difficult than those flown by any other autonomous helicopter.

  23. Conclusion • Apprenticeship learning for the dynamics model avoids explicit exploration in our experiments. • Procedure based on inverse RL for the reward function gives performance similar to human pilots. • Our results significantly extend state of the art in autonomous helicopter flight: first autonomous completion of stationary flips and rolls, tail-in funnels and nose-in funnels.

  24. Acknowledgments • Ben Tse, Garett Oku, Antonio Genova. • Mark Woodward, Tim Worley.

  25. Continuous flips

More Related