1 / 1

Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers

Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers. M. Gašić , F. Jurčíček, S. Keizer, F. Mairesse, B. Thomson, K. Yu, S. Young Cambridge University Engineering Department | {mg436, fj228 , sk561, farm2, brmt2, ky219, sjy}@eng.cam.ac.uk.

tangia
Télécharger la présentation

Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers M. Gašić, F. Jurčíček, S. Keizer, F. Mairesse, B. Thomson, K. Yu, S. Young Cambridge University Engineering Department | {mg436, fj228, sk561, farm2, brmt2, ky219, sjy}@eng.cam.ac.uk POMDP-based Dialogue Management Kernel Function Real-world Problem: CamInfo Dialogue • Models the uncertainty about the dialogue state by maintaining a distribution over all possible dialogue states in every turn – belief state • Maximises the overall dialogue success by optimisingQ-function Q(a,b(s))– the highest expected long-term reward for action ataken in belief state b(s) • Problem: To ensure tractability of optimisation, standard methods discretise the belief space into a small number of points, causing the loss of information. • Solution: Model the continuous nature of the belief state and include the prior knowledge about the domain to speed up the learning process. • If the Q-function value was known in one belief state-action pair what is the Q-function value in another belief state for the same action? • Prior knowledge about Q-function correlations is incorporated in the kernel function. • Kernel hyper-parameters can be learned from the data labelled with belief states, actions and rewards. In that way the kernel captures correlations found in the data. • Hidden Information State Dialogue Manager • POMDP-based Dialogue Manager that can tractably maintain belief state for real-world problems • Optimises policy in a reduced summary space • CamInfo Domain • Tourist Information domain for Cambridge, UK Belief state Action Q-function value Would you like to save or delete the message? Results on CamInfo • Comparison: • GP-Sarsa with polynomial kernel • Active learning GP-Sarsa – during exploration selects the action that has the highest uncertainty estimated by the Gaussian Process • Grid-based Monte Carlo Control Algorithm Would you like to save or delete the message? Gaussian Processes in Reinforcement Learning • Gaussian Process (GP) – non-parametric Bayesian model for function approximation • For given prior function correlations and some noisy function observations, it estimates the posterior of any function value • GP-Sarsa is an online Reinforcement learning algorithm that models Q-function as a Gaussian Process Results on VoiceMail Toy problem: VoiceMail Dialogue • Comparison: • GP-Sarsawith various kernel functions • Grid-based Monte Carlo Control algorithm • Exact POMDP solution • In order to estimate the speed of convergence, the policy was evaluated after every training batch: The user asks the system to save or delete the message. The user input is corrupted with noise, so the true dialogue state is unknown. Which action leads to success? Standard approach: Q(a,b(s)) value GP-Sarsa: Distribution of Q(a,b(s)) value Action a Would you like to save or delete the message? a Conclusion action b(s) b’(s) Your message is deleted. • GP-Sarsa can obtain the optimal policy faster than standard methods provided an adequate choice of the kernel function • The measure of uncertainty that GP-Sarsa is estimating can be utilised in an Active Learning framework to additionally speed up the learning process r (immediate) reward Acknowledgements belief state This research was partly funded by the UK EPSRC under grant agreement EP/F013930/1 and by the EU FP7 Programme under grant agreement 216594 (CLASSiC project: www.classic-project.org). A Gaussian Process models every Q-function value Q(a,b(s)) as a Gaussian distributed random variable. The variance of the Gaussian distribution provides a measure of uncertainty about the approximation.

More Related