1 / 32

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning. Presenter: Verena Rieser, vrieser@coli.uni-sb.de Course: Classification and Clustering, WS 2005. Contents. Part 1: The main ideas of RL Part 2: The general framework of RL Part 3:

rafer
Télécharger la présentation

An Introduction to Reinforcement Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Introduction to Reinforcement Learning Presenter: Verena Rieser, vrieser@coli.uni-sb.de Course: Classification and Clustering, WS 2005 V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

  2. Contents • Part 1: The main ideas of RL • Part 2: The general framework of RL • Part 3: Automatic Optimization of Dialogue Management (Application) V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

  3. Reinforcement Learning Artificial Intelligence Control Theory and Operations Research Psychology Reinforcement Learning (RL) Neuroscience Artificial Neural Networks V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

  4. Part 1:The Idea of Reinforcement Learning Learning from interaction • with environment • to achieve some goal • Example 1 Baby playing: No teacher; sensorimotor connection to environment. • Cause-effect/Action-consequences • How to achieve some goal • Example 2 Learning to hold a conversation, etc. • We find out the effects of our actions later. V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

  5. Supervised Learning Training Info = desired (target) outputs Supervised Learning System Inputs Outputs Error = (target output – actual output) V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

  6. Reinforcement Learning Training Info = evaluations (“rewards” / “penalties”) RL System Inputs Outputs (“actions”) Objective: get as much reward as possible V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

  7. R L - How does it work? Learning a mapping from situations to actions in order to maximize a scalar reward/reinforcement signal. • How? • Try out actions to learn which produces highest reward - trial and error search • Actions affect immediate reward + all subsequent rewards - delayed effects, delayed rewards V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

  8. Exploration/Exploitation Trade-off • High rewards from trying previously-well-rewarded actions - EXPLOITATION (= greedy) • BUT: Which actions are best? Must try ones not tried before - EXPLORATION (= e) Must do both! • Exploitation/Exploration trade-off also depends on the life-time of an agent. V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

  9. e-Greedy Methods on the 10-Armed Testbed[Sutton and Barto, 2002] V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

  10. Part 2: Framework of RL • Temporally situated • Continual learning and planning • Object is to affect the environment • Environment is stochastic and uncertain Environment action state reward Agent V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

  11. Elements of RL • Policy: what to do • Reward: what is good • Value: what is good because it predicts reward • Model: what follows what Policy Reward Value Model of environment V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

  12. General RL Algorithm • Initialise learner’s internal state • Do forever (!?): • Observe current state s • Choose action a using some evaluation function • Execute action a • Let r be immediate reward, s’ new state • Update internal state based on s,a,r,s’ V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

  13. To solve the problem mathematically: • Formulate it as Markov Decision Process (MDP) or Partially ObservableMarkov Decision Process (POMDP) • Maximize the state-value and action-value functions using the Bellmann optimality equation • Use approximations to solve the Bellmann equation such as dynamic programming, Monte Carlo Methods, Temporal-difference Learning. V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

  14. The Bellmann Equation • Bellmann optimality equation estimates “how good” it is to be in a state s. “What actions are available?” “How good are those actions?” V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

  15. Summary: Key Features of RL • Learner is not told which actions to take • Trial-and-Error search • Possibility of delayed reward (Sacrifice short-term gains for greater long-term gains) • The need to explore and exploit • Considers the whole problem of a goal-directed agent interacting with an uncertain environment V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

  16. Interactive Exercise: • Help me to annotate the example “a dog catching a stick” with concepts from RL. • Explain: How would an artificial dog learn to catch the stick using RL? V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

  17. Part 3: Application for CoLi Diane J. Litman and Michael S. Kearns and Satinder Singh and Marilyn A. Walker: Automatic Optimization of Dialogue Management. In: Proceedings of the 18th International Conference on Computational Linguistics (COLING-2000), Saarbrücken, 2000. V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

  18. Dialogue Management Motivation: • Agent wants to achieve some goal • Non-trivial choices based on the internal state • Usability should be guaranteed by iterative prototyping DM is costly! Why not “simply” learn the optimal choices? • Formulate dialogue as MDP • Represent the environment (= states) • Define a set of possible dialogue strategies (= actions) • Evaluate actions (= reward) V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

  19. The NJFun System • Represent a dialogue strategy as mapping from state S to a set of dialogue acts • Deploy an initial training system which generates exploratory training data w.r.t. S • Construct an MPD model from the training data • Using value iteration to learn the optimal strategy • Evaluate the system w.r.t. a hand-coded strategy V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

  20. NJFun: Action Space • Initiative • User: the system asks open questions with an unrestricted grammar for recognition • System: the system uses directed prompts with restricted grammars • Mixed: the system uses directed prompts with non-restricted grammars • Confirmation • Explicit: the system asks the user to verify an attribute • No confirmation: the system does not generate a confirmation prompt V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

  21. NJFun: State Space • {Greet}: whether the system has greeted the user or not (0,1) • {Attr}: which attr the system is trying to obtain or verify (1=activity, 2=location, 3=time, 4=done) • {Conf}: ASR confidence after obtaining value for an attribute (0,1,2,3,4) • {Val}: whether system has obtained a value for an attribute (0,1) • {Times}: number of times the system has asked for the attribute • {Gram}: type of grammar most recently used to obtain the attribute • {Hist}: “trouble-in-past” V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

  22. Example S1: Welcome to NJFun. How may I help you? s[greet=1] - a[user initiative] U1: I’d like to find *um* wine tasting in Lambertville. s[conf=2, val=1] • S2a: Did you say you are interested in wine tasting in Lambertville? s’[attr=(1,2), times=1] - a[explicit confirmation ] • S2b: At what time? s’[attr=3] - a[no confirmation ] V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

  23. NJFun: Optimizing the strategy • NJFun’s initial strategy: “Exploratory for Initiative and Confirmation” (EIC); chooses randomly between possible actions in each state • Data: 54 subjects for training, 21 for testing • Binary reward function: 1 if system queries DB with all specified attr., 0 otherwise • Results: large and significant improvement for expert user and non-significant degeneration for novice V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

  24. Discussion • How general are the features? What about dialogues on other domains (e.g. information seeking vs. Tutorial dialogue) • What about the algorithm? Why can’t we use supervised learning? • Do we really save costs? • Stochastic user models for training • “boot-strap” an initial system from training data V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

  25. Additional Slides V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

  26. Simple Learning Taxonomy • Supervised Learning • “Teacher” provides required response to inputs. Desired behaviour is know. • Unsupervised Learning • Learner looks for patterns in input. No “right” answer. • Reinforcement Learning • Learner not told which actions to take, but gets reward/punishment from environment and learns the action to pick the next time. V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

  27. RL vs. SL • The main problem facing a SL system is • to construct a mapping from situations to actions that mimics the correct actions specified by the environment • and that generalizes correctly to new situations. • A SL system cannot be said to learn to control its environment because • it follows, rather than influences, the instructive information it receives. • Instead of trying to make its environment behave in a certain way, it tries to make itself behave as instructed by its environment. V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

  28. RL vs. US • US: • Make some decision *now* which satisfies the immediate constrains (e.g. clustering: clusters should be not smaller than n) • RL: • Plan your decision to achieve some goal in the future; delayed rewards V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

  29. A More Formal Definition of the RL Framework... POLICY p(s,a) =P{at= a|st=t} Given the situation at time t is s the policy which gives the probability that the agent’s action will be a. Reward function Defines goal, and immediate good or bad experience Value function Estimate of total future long-term reward. (We want actions that lead to states of high value, not necessarily high immediate reward!) Model of environment Maps states and actions onto states S´AxS. If in state s1 we take action a2 the model predicts s2 (and sometimes reward r2) V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

  30. Markov Property • A state signal that succeeds in retaining all relevant information is said to be Markov, or to have the Markov property. • For example: the current position and velocity of a cannonball is all that matters for its future flight. It doesn't matter how that position and velocity came about. • This is sometimes also referred to as an "independence of path" property because all that matters is in the current state signal; its meaning is independent of the "path," or history, of signals that have led up to it. V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

  31. MPDs vs. POMPDs Major difference: how they represent uncertainty. • In MPDs the state space is in general represented as vectors describing information slots where each is associated with a discrete value. • POMPDS explicitly model uncertainty by maintaining a belief state - a distribution over MPD states - in the absence of knowing its state exactly. V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

  32. Some Notable RL Applications • TD-Gammon: Tesauro • world’s best backgammon program • Elevator Control: Crites & Barto • high performance down-peak elevator controller • Dynamic Channel Assignment: Singh & Bertsekas, Nie & Haykin • high performance assignment of radio channels to mobile telephone calls • In general applicable for all (?) optimization tasks which are goal-oriented V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

More Related