Kshitij Judah, Alan Fern, Tom Dietterich

Active Imitation Learning via State Queries Kshitij Judah, Alan Fern, Tom Dietterich School of EECS, Oregon State University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:

Preliminaries • A Markov Decision Process (MDP) is a tuple where • is the set of states • is the set of actions • is the transition function denoting probability of transitioning to state after taking action in • is the reward function giving reward in state • is the initial state • A stationary policy is a mapping from states to actions • The H-horizon value of a policy is the expected total reward of trajectories that start at and follow for H steps

Passive Imitation Learning Trajectory Data Supervised Learning Algorithm Classifier Learner Teacher GOAL:To learn a policy whose H-horizon value is not much worse than

Passive Imitation Learning Trajectory Data Supervised Learning Algorithm Classifier Learner Teacher • DRAWBACK: • Generating such trajectories can be tedious and may even be impractical. Real-time low-level Control of multiple Game agents!!

Active Imitation Learning via State Queries Teacher correct action to take in is Simulator Learner Select Best State Query Current Training data (s, a) pairs

Active Imitation Learning via State Queries Teacher This is a bad state which I would never visit!! I choose not to suggest any action Simulator Learner Select Best State Query Bad State( ) Current Training data (s, a) pairs

Bad State Response Wargus Expert Simulator Select Best State Query A bad state query!! Wargus Agent Bad State( ) Current Training data (s, a) pairs

Bad State Response Expert Pilot Simulator Select Best State Query A bad state query!! Helicopter Flying Agent Bad State( ) Current Training data (s, a) pairs

Bad State Response Teacher • It is important to minimize bad state queries!! correct action to take in is Simulator Challenge: how to combine action uncertainty and bad-state likelihood Select Best State Query Learner Current Training data (s, a) pairs We provide a principled approach based on noiseless Bayesian active learning

Relation to Passive Imitation Learning • It is possible to simulate passive imitation learning via state queries N Trajectory Data Supervised Learning Algorithm

Relation to I.I.D. Active Learning Teacher correct action to take in is Simulator Select Best State Query Single known target distribution Learner Current Training data (s, a) pairs

Relation to I.I.D. Active Learning Teacher correct action to take in is Simulator Select Best State Query Single known target distribution Learner Current Training data (s, a) pairs • Applying i.i.d. active learning uniformly over entire state space leads to poor performance: Queries are in uncertain states that are also bad!!

Noiseless Bayesian Active Learning (BAL) Hypotheses Tests Test Outcomes • Goal: identify true hypothesis with as few tests as possible • We employ a form of generalized binary search (GBS) in this work

BAL for Deterministic MDPs

BAL for Deterministic MDPs GOAL:Determine the path corresponding to by performing test from that have outcomes (teacher responses)

BAL for Deterministic MDPs

Imitation Query-by-Committee (IQBC) for Large MDPs Labeled Data (s, a) pairs Bootstrap Sample K Bootstrap Sample 1 Bootstrap Sample 2 Bootstrap Sample 3 Supervised Learner Supervised Learner Supervised Learner Supervised Learner Simulator Simulator Simulator Simulator Path 1 Path 2 Path 3 Path K Generalized Binary Search

Action Uncertainty versus Bad-State Trade-off • can be rewritten in the following form: Posterior prob. mass of hypotheses that go through s Entropy of multinomial distribution over actions at s Posterior prob. of target policy visiting s Small bonus that is maximized when Uncertainty over action choices at s

Stochastic MDPs • We use Pegasus style determinization approach to handle stochastic MDPs (Ng & Jordan, UAI 2000) • Details are in the paper!!

Experiments • We performed experiments in two domains: • A grid world with pits • Cart pole • We compared IQBC against following baselines : • Random: • Selects states to query uniformly at random • Standard QBC (SQBC): • Treats all states as i.i.d. and applies standard uncertainty based QBC • Passive imitation learning (Passive): • Simulates standard passive imitation learning • Confidence based autonomy (CBA) (Chernova & Veloso, JAIR 2009): • Executes policy until the confidence falls below an automatically adjusted threshold, at which point the learner queries the teacher for an action, updates its policy and threshold and resumes execution • Performance can be quite sensitive to threshold adjustment

Grid World With Pits Pit Pit 30 Pit Pit Goal 30

Teacher Types • Generous: always responds with an action • Strict: declares states far away from the states visited by the teacher as bad states

Grid World With Pits: Results “Generous” teacher

Grid World With Pits: Results “Strict” teacher

Cart Pole • state = • Actions = left or right • Bounds on cart position and pole angle are [-2.4, 2.4] and [-90, 90] resp.

Cart Pole: Results “Generous” teacher

Cart Pole: Results “Strict” teacher

Future Work • Develop policy optimization algorithms that take responses and other forms of teacher input • Query short sequence of states rather than single states • Consider more application areas like structured prediction and other RL domains • Conduct studies with human teachers

Kshitij Judah, Alan Fern, Tom Dietterich