400 likes | 594 Vues
An Introduction to Reinforcement Learning. Presenter: Verena Rieser, vrieser@coli.uni-sb.de Course: Classification and Clustering, WS 2005. Contents. Part 1: The main ideas of RL Part 2: The general framework of RL Part 3:
E N D
An Introduction to Reinforcement Learning Presenter: Verena Rieser, vrieser@coli.uni-sb.de Course: Classification and Clustering, WS 2005 V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.
Contents • Part 1: The main ideas of RL • Part 2: The general framework of RL • Part 3: Automatic Optimization of Dialogue Management (Application) V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.
Reinforcement Learning Artificial Intelligence Control Theory and Operations Research Psychology Reinforcement Learning (RL) Neuroscience Artificial Neural Networks V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.
Part 1:The Idea of Reinforcement Learning Learning from interaction • with environment • to achieve some goal • Example 1 Baby playing: No teacher; sensorimotor connection to environment. • Cause-effect/Action-consequences • How to achieve some goal • Example 2 Learning to hold a conversation, etc. • We find out the effects of our actions later. V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.
Supervised Learning Training Info = desired (target) outputs Supervised Learning System Inputs Outputs Error = (target output – actual output) V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.
Reinforcement Learning Training Info = evaluations (“rewards” / “penalties”) RL System Inputs Outputs (“actions”) Objective: get as much reward as possible V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.
R L - How does it work? Learning a mapping from situations to actions in order to maximize a scalar reward/reinforcement signal. • How? • Try out actions to learn which produces highest reward - trial and error search • Actions affect immediate reward + all subsequent rewards - delayed effects, delayed rewards V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.
Exploration/Exploitation Trade-off • High rewards from trying previously-well-rewarded actions - EXPLOITATION (= greedy) • BUT: Which actions are best? Must try ones not tried before - EXPLORATION (= e) Must do both! • Exploitation/Exploration trade-off also depends on the life-time of an agent. V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.
e-Greedy Methods on the 10-Armed Testbed[Sutton and Barto, 2002] V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.
Part 2: Framework of RL • Temporally situated • Continual learning and planning • Object is to affect the environment • Environment is stochastic and uncertain Environment action state reward Agent V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.
Elements of RL • Policy: what to do • Reward: what is good • Value: what is good because it predicts reward • Model: what follows what Policy Reward Value Model of environment V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.
General RL Algorithm • Initialise learner’s internal state • Do forever (!?): • Observe current state s • Choose action a using some evaluation function • Execute action a • Let r be immediate reward, s’ new state • Update internal state based on s,a,r,s’ V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.
To solve the problem mathematically: • Formulate it as Markov Decision Process (MDP) or Partially ObservableMarkov Decision Process (POMDP) • Maximize the state-value and action-value functions using the Bellmann optimality equation • Use approximations to solve the Bellmann equation such as dynamic programming, Monte Carlo Methods, Temporal-difference Learning. V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.
The Bellmann Equation • Bellmann optimality equation estimates “how good” it is to be in a state s. “What actions are available?” “How good are those actions?” V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.
Summary: Key Features of RL • Learner is not told which actions to take • Trial-and-Error search • Possibility of delayed reward (Sacrifice short-term gains for greater long-term gains) • The need to explore and exploit • Considers the whole problem of a goal-directed agent interacting with an uncertain environment V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.
Interactive Exercise: • Help me to annotate the example “a dog catching a stick” with concepts from RL. • Explain: How would an artificial dog learn to catch the stick using RL? V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.
Part 3: Application for CoLi Diane J. Litman and Michael S. Kearns and Satinder Singh and Marilyn A. Walker: Automatic Optimization of Dialogue Management. In: Proceedings of the 18th International Conference on Computational Linguistics (COLING-2000), Saarbrücken, 2000. V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.
Dialogue Management Motivation: • Agent wants to achieve some goal • Non-trivial choices based on the internal state • Usability should be guaranteed by iterative prototyping DM is costly! Why not “simply” learn the optimal choices? • Formulate dialogue as MDP • Represent the environment (= states) • Define a set of possible dialogue strategies (= actions) • Evaluate actions (= reward) V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.
The NJFun System • Represent a dialogue strategy as mapping from state S to a set of dialogue acts • Deploy an initial training system which generates exploratory training data w.r.t. S • Construct an MPD model from the training data • Using value iteration to learn the optimal strategy • Evaluate the system w.r.t. a hand-coded strategy V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.
NJFun: Action Space • Initiative • User: the system asks open questions with an unrestricted grammar for recognition • System: the system uses directed prompts with restricted grammars • Mixed: the system uses directed prompts with non-restricted grammars • Confirmation • Explicit: the system asks the user to verify an attribute • No confirmation: the system does not generate a confirmation prompt V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.
NJFun: State Space • {Greet}: whether the system has greeted the user or not (0,1) • {Attr}: which attr the system is trying to obtain or verify (1=activity, 2=location, 3=time, 4=done) • {Conf}: ASR confidence after obtaining value for an attribute (0,1,2,3,4) • {Val}: whether system has obtained a value for an attribute (0,1) • {Times}: number of times the system has asked for the attribute • {Gram}: type of grammar most recently used to obtain the attribute • {Hist}: “trouble-in-past” V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.
Example S1: Welcome to NJFun. How may I help you? s[greet=1] - a[user initiative] U1: I’d like to find *um* wine tasting in Lambertville. s[conf=2, val=1] • S2a: Did you say you are interested in wine tasting in Lambertville? s’[attr=(1,2), times=1] - a[explicit confirmation ] • S2b: At what time? s’[attr=3] - a[no confirmation ] V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.
NJFun: Optimizing the strategy • NJFun’s initial strategy: “Exploratory for Initiative and Confirmation” (EIC); chooses randomly between possible actions in each state • Data: 54 subjects for training, 21 for testing • Binary reward function: 1 if system queries DB with all specified attr., 0 otherwise • Results: large and significant improvement for expert user and non-significant degeneration for novice V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.
Discussion • How general are the features? What about dialogues on other domains (e.g. information seeking vs. Tutorial dialogue) • What about the algorithm? Why can’t we use supervised learning? • Do we really save costs? • Stochastic user models for training • “boot-strap” an initial system from training data V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.
Additional Slides V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.
Simple Learning Taxonomy • Supervised Learning • “Teacher” provides required response to inputs. Desired behaviour is know. • Unsupervised Learning • Learner looks for patterns in input. No “right” answer. • Reinforcement Learning • Learner not told which actions to take, but gets reward/punishment from environment and learns the action to pick the next time. V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.
RL vs. SL • The main problem facing a SL system is • to construct a mapping from situations to actions that mimics the correct actions specified by the environment • and that generalizes correctly to new situations. • A SL system cannot be said to learn to control its environment because • it follows, rather than influences, the instructive information it receives. • Instead of trying to make its environment behave in a certain way, it tries to make itself behave as instructed by its environment. V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.
RL vs. US • US: • Make some decision *now* which satisfies the immediate constrains (e.g. clustering: clusters should be not smaller than n) • RL: • Plan your decision to achieve some goal in the future; delayed rewards V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.
A More Formal Definition of the RL Framework... POLICY p(s,a) =P{at= a|st=t} Given the situation at time t is s the policy which gives the probability that the agent’s action will be a. Reward function Defines goal, and immediate good or bad experience Value function Estimate of total future long-term reward. (We want actions that lead to states of high value, not necessarily high immediate reward!) Model of environment Maps states and actions onto states S´AxS. If in state s1 we take action a2 the model predicts s2 (and sometimes reward r2) V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.
Markov Property • A state signal that succeeds in retaining all relevant information is said to be Markov, or to have the Markov property. • For example: the current position and velocity of a cannonball is all that matters for its future flight. It doesn't matter how that position and velocity came about. • This is sometimes also referred to as an "independence of path" property because all that matters is in the current state signal; its meaning is independent of the "path," or history, of signals that have led up to it. V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.
MPDs vs. POMPDs Major difference: how they represent uncertainty. • In MPDs the state space is in general represented as vectors describing information slots where each is associated with a discrete value. • POMPDS explicitly model uncertainty by maintaining a belief state - a distribution over MPD states - in the absence of knowing its state exactly. V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.
Some Notable RL Applications • TD-Gammon: Tesauro • world’s best backgammon program • Elevator Control: Crites & Barto • high performance down-peak elevator controller • Dynamic Channel Assignment: Singh & Bertsekas, Nie & Haykin • high performance assignment of radio channels to mobile telephone calls • In general applicable for all (?) optimization tasks which are goal-oriented V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.