Skill Reconstruction and Behavioural Cloning Using Machine Learning

Machine reconstruction of human control strategies DorianŠuc Artificial Intelligence Laboratory Faculty of Computer and Information Science University of Ljubljana, Slovenia

Overview • Skill reconstruction and behavioural cloning • The learning problem • A problem decomposition for behavioural cloning (indirect controllers, experiments, advantages) • Symbolic and qualitative skill reconstruction • Learning qualitative strategies: QUIN algorithm • QUIN in skill reconstruction • Conclusions

Skill reconstructionand behavioural cloning • Motivation: • understanding of the human skill • development of an automatic controler • ML approach to skill reconstruction: learn a control strategy from the logged data from skilled human operators (execution trace). Later called behavioural cloning (Michie, 93). • Early work: Chambers and Michie(69), learning control by imitation also by Donaldson(60,64)

Behavioural cloning: some applications • Original approach: clones usually induced as a direct mapping from states to actions in the form of trees or rule sets • Successfully used in domains as: • pole balancing (Miche et al., 90) • piloting (Sammut et al., 92; Camacho 95) • container cranes (Urbančič, 94) • production line scheduling (Kerr and Kibira, 94) Reviews in Sammut(96), Bratko et al(98)

Learning problem • Execution traces used as examples for ML to induce: • a control strategy (comprehensible, symbolic) • automatic controller (criterion of success) • Operator’s execution trace: • a sequence of system states and corresponding operator’s actions, logged to a file at a certain frequency • Reconstruction of human control skill: • Skill: “know how” at subsymbolic level, operational • Strategy: explicitly described “know how” at symbolic level

Container crane Used in ports for load transportation Control forces: Fx, FL State: X, dX,, d, L, dL Based on previous work of Urbančič(94) Control task: transport the load from the start to the goal position

Learning problem, cont. Fx FL X dX d L dL 0 0 0.00 0.00 0.00 0.00 20.00 0.00 2500 0 0.00 0.00 -0.00 -0.01 20.00 0.00 6000 0 0.00 0.01 -0.01 -0.02 20.00 0.00 10000 0 0.02 0.10 -0.07 -0.27 20.00 0.00 14500 0 0.12 0.31 -0.32 -0.85 20.00 0.00 14500 0 0.35 0.59 -0.95 -1.49 20.00 0.01 ….… … … … … … …….

Problems of original approach Difficulties observed with the original approach: • No guarantee of inducing with high probability a successful clone (Urbančič and Bratko, 94) • Low robustness of clones • Comprehensibility of clones; hard to understand Michie(93,95) suggests that a kind problem decomposition could be helpful: “learning from exemplary performance requires more than mindless imitation” Recent approaches to behavioural cloning (Stirling, 95; Bain and Sammut, 99; Camacho, 2000)

Related work • Leech(86), probably the first goal-structured learning of control • CHURPs(Stirling, 95): separates control skills in planning and actuation phases; focuses on planning component; assumes the goals are given • GRAIL(Bain and Sammut, 99): learning goals by decision trees and effects by abduction; • Incremental Correction model(Camacho, 2000): homeostatic and achievable goals; parametrised decision trees to learn goals; wrapper-approach

Our approach Our goals: • transparency of the induced strategies • robust and successful controllers Ideas: • Learning problem decomposition: (a) learning of the constraints on operator’s trajectories, (b) learning of the system’s dynamics • Generalized trajectory as a continuous subgoal • Symbolic and qualitative constraints, use of domain knowledge Differences with related approaches: • continuous generalized trajectory • qualitative strategies

Experimental domains Container crane: • we used execution traces from (Urbančič, 94) Acrobot (DeJong, 95; Sutton, 96) • two link pendulum in a graviatational field; swing-up task Bicycle riding (Randlov, Alstrm, 98) • drive the bike from the start to the goal position; requires simultaneous balancing and goal-aiming Simulators used in all experiments Measure of success: • time to accomplish the task

Operator’s trajectory • A sequence of the states from an execution trace • Path in the state space Operator’s trajectory of the trolley velocity (dX) in the space of X,  and dX

Generalized trajectory Induced constraints on operator’s trajectory • Constraints can be represented as: • trees • equations • qualitative constraints

Qualitative and quantitative strategy • Quantitative strategy: given with precise numerical values or numeric constraints (decision tree, equation) • Qualitative strategy may also use qualitative constraints. A qualitative strategy defines a set of quantitative strategies • We use qualitatively constrained functions (QCFs): monotonicity constraints as used in qualitative reasoning

Qualitatively constrained functions • M+(x) arbitrary monotonically increasing fn. of x • A QCF is a generalization of M+, similar to qual. proportionality predicates used in QPT(Forbus, 84) Gas in the container: Pres = c Temp / Vol , c = n R > 0 QCF: Pres = M+,-(Temp,Vol) Temp=std & Vol   Pres  Temp  & Vol   Pres  Temp  & Vol   Pres  Temp  & Vol   Pres ? Temp  & Vol   Pres ?

Problem decomposition

Direct and indirect controllers Our approach; Also CHURPs(Stirling, 95), GRAIL(Bain and Sammut, 99), ICM(Camacho, 2000) Original approach, BOXES, ASE/ACE

Robustness of direct and indirect controllers against learning error • Experiment: modelling learning of direct and indirect controllers with some learning error: • direct controllers: “correct action” + noise() • indirect controllers: “correct trajectory” + noise() • Two error models: • Gaussian noise • Biased Gaussian noise (all errors in the same direction) • Simple, deterministic, discrete time system: • Control task: reach and maintain the goal value Xg • Performance criterion: controller error in Xg

Robustness of direct and indirect controllers against learning error (2) Biased noise affects direct controllers much more

Possible advantages of indirect controllers • Less prone to the departure from the operator’s trajectory • More robust against change in the system’s dynamics and small changes in the task • generalizing the trajectory is often easier than generalizing the actions Generalized trajectory often easier to understand (less details)

Symbolic and qualitative skill reconstruction GoldHorn(Križman, 98) LWR(Atkeson et al., 97) • Experiments in the crane and acrobot domains

Experiments in the crane domain • GoldHorn induced the generalized trajectory of the trolley velocity: dXdes= 0.902 – 0.018 X2 + 0.090 X + 0.050  Qualitative strategy: if X  Xmid then dX = M+,+(X, ) else dX = M-,+(X, )

Transforming qualitative into quantitative strategies • By concretizing qualitative parameters into real, numeric values or real-valued functions • First experiment: using randomly generated functions satisfying qualitative constraints and additional domain knowledge: • maximal and minimal values of the state variables • the trolley starts towards goal • the trolley stops at goal • Second experiment: using additional domain knowledge

Efficiency of the qualitative strategy • The results show that qualitative strategy is: • general (the proper selection of qualitative parameters is not crucial) • successful: offers the space for controller optimization • Similar experiments in acrobot domain

Qualitative induction • Motivation: our experiments with qualitative strategies (crane, acrobot) • Usual classification learning problem, but learning of qualitative trees: • in leaves are qualitatively constrained functions (QCFs); QCFs give constraints on the class change in response to a change in attributes • internal nodes (splits) define a partition of the state space into areas with common qualitative behavior of the class variable

Qualitatively constrained function (QCF) • M+(x) arbitrary monotonically increasing fn. of x • A QCF is a generalization of M+, similar to qual. proportionality predicates used in QPT(Forbus, 84) Gas in the container: Pres = c Temp / Vol , c = n R > 0 QCF: Pres = M+,-(Temp,Vol) Temp=std & Vol   Pres  Temp  & Vol   Pres  Temp  & Vol   Pres  Temp  & Vol   Pres ? Temp  & Vol   Pres ?

Learning QCFs Pres = 2 Temp / Vol Temp Vol Pres 315.00 56.00 11.25 315.00 62.00 10.16 330.00 50.00 13.20 300.00 50.00 12.00 300.00 55.00 10.90 • Learning of the “most consitent” QCF: • For each pair of examples form a qualitative change vector • Select the QCF with minimal error-cost

Learning QCFs QCF Incons. Amb. M+(Temp) M-(Temp) M+(Vol) M-(Vol) M+,+(Temp,Vol) M+,-(Temp,Vol) M-,+(Temp,Vol) M-,-(Temp,Vol) QCF Incons. Amb. M+(Temp) 3 1 M-(Temp) M+(Vol) M-(Vol) M+,+(Temp,Vol) M+,-(Temp,Vol) M-,+(Temp,Vol) M-,-(Temp,Vol) QCF Incons. Amb. M+(Temp) 3 1 M-(Temp) 2,4 1 M+(Vol) 1,2,3 / M-(Vol) 4 / M+,+(Temp,Vol) 1,3 2 M+,-(Temp,Vol) / 3,4 M-,+(Temp,Vol) 1,2 3,4 M-,-(Temp,Vol) 4 2 qTemp=neg qVol=neg qPres=pos Select QCF with minimal QCF error-cost

Learning qualitative tree • For every possible split, split the examples into two subsets, find the “most consistent” QCF for both subsets and select the split minimizing tree-error cost (based on MDL) • Algorithm ep-QUIN uses every pair of examples • An improvement: heuristic QUIN algorithm that considers also locality and consistency of qualitative change vectors

Experimental evaluation in artificial domains • On a set of artificial domains with uniformly distributed attributes; 2 irrelevant attributes • Results by QUIN better than ep-QUIN • In simple domains QUIN finds qualitative relations corresponding to our intuition

QUIN in bicycle riding Control task: drive a bike from the start to the goal position the bike’s speed is assumed constant difficult because balancing and goal-aiming must be performed simultaneously • Controlled by torque applied to the handlebars • State: goalAngle, goalDist, , d, , d • QUIN: des = f(State)

Induced qualitative strategy goalAngle  0.015 > 0.015 goalAngle M+,+,-(, d,goalAngle)  -0.027 > -0.027 M+,+,-(, d,goalAngle) M+,+(, d) Same QCFs

Induced qualitative strategy goalAngle near zero yes no M+,+(, d) M+,+,-(, d,goalAngle) Balancing Balancing and goal-aiming If the bike starts falling over then turn the front wheel in the direction of the fall Goal-aiming: turn the front wheel away from the goal

Transforming qualitative into quantitative strategies • Transform QCFs into real valued functions by using simple domain knowledge: • maximal front wheel deflection • drive straight if bike is aiming at the goal: f(0,0,0)=0 • balancing is more important than aiming at the goal • 400 randomly generated quantitative strategies; 59.2% successful • Test of robustness: • Change in the start state (58% successful) • Random displacement of the bicyclist from the mass center (26% successful)

QUIN in crane domain • Crane control requires trolley and rope control • Experiments with traces of 2 operators using different control styles • Rope control • QUIN: Ldes= f(X, dX, ,d, dL) • Often very simple strategy induced Ldes= M+(X ) bring down the load as the trolley moves from the start to the goal position

Trolley control • QUIN: dXdes= f(X, ,d) • More diversity in the induced strategies Enables reconstruction of individual differences in control styles X < 20.7 X < 29.3 yes no yes no M+(X) M+,+,-(X, , d) X < 60.1 d < -0.02 yes no yes no M-(X) M+() M-(X) M-,+(X,)

Role of human intervention • Approach facilitates the use of user knowledge • In our experiments the following types of human intervention were used: • Selection of the dependent trajectory variable • Disregarding some state variables • Selection and analysis of induced equations • Using domain knowledge in transforming qualitative into quantitative strategies • According to empirical evidence different (sensible) choices and use of domain knowledge also give successful strategies

Contributions of the thesis • A decomposition of the behavioural cloning problem into the learning of continuous generalized trajectory and system’s dynamics • Modelling of human skill with symbolic and qualitative constraints • QUIN algorithm for learning qualitative constraint trees • Applying QUIN to skill reconstruction • Experimental evaluation in several dynamic domains

Further work • Applying QUIN in different domains where qualitative models preferred; QUIN improvements • Qualitative simulation to generate possible explanations of a qualitative strategy • Reducing the space of admissible controllers by qualitative reasoning • Minimizing the trajectory constraints error in all the state variables would not require the selection of the dependent trajectory variable

Skill Reconstruction and Behavioural Cloning Using Machine Learning

Skill Reconstruction and Behavioural Cloning Using Machine Learning

Presentation Transcript

Human Control of Systems

Human-Machine Systems

Human Machine Interfaces

Strategies for Human-Human Interaction

Human-Machine Reconfigurations

Human vs. Machine

Congress Takes Control of Reconstruction

Machine Tool Control

Human Resource Strategies

Machine Control

Human Machine Interface

The Human Machine

Virtual Fixture Control for Compliant Human-Machine Interfaces

Human-Machine Boundary

HUMAN MACHINE INTERFACE

Total Machine Control

Human-Machine Coevolution

Human Machine Intererface

Preparation of Control Strategies

Machine reconstruction of human control strategies

Machine control instruction

Human Machine Interface