1 / 21

Optimal Learning & Bayes -Adaptive MDPs

Optimal Learning & Bayes -Adaptive MDPs. An Overview Slides on M. Duff’s Thesis/Ch.1 SDM-RG, Mar-09. Optimal Learning: Overview. Behaviour that maximizes expected total reward while interacting with an uncertain world. Behave well while learning, learn while behaving well.

kolya
Télécharger la présentation

Optimal Learning & Bayes -Adaptive MDPs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimal Learning &Bayes-Adaptive MDPs An Overview Slides on M. Duff’s Thesis/Ch.1 SDM-RG, Mar-09 Slides prepared by Georgios Chalkiadakis

  2. Optimal Learning: Overview Slides prepared by Georgios Chalkiadakis Behaviour that maximizes expected total reward while interacting with an uncertain world. Behave well while learning, learn while behaving well.

  3. Optimal Learning: Overview Slides prepared by Georgios Chalkiadakis • What does it mean to behave optimally under uncertainty? • Optimality is defined with respect to a distribution of environments. • Explore vs. Exploit given prior uncertainty regarding environments • What is the “value of information”?

  4. Optimal Learning: Overview Slides prepared by Georgios Chalkiadakis Bayesian approach: Evolve uncertainty about unknown process parameters The parameters describe prior distributions about the world model (transitions/rewards) That is, about information states

  5. Optimal Learning: Overview Slides prepared by Georgios Chalkiadakis The sequential problem is described by a “hyperstate”-MDP (“Bayes-Adaptive MDP”): Instead of just physical states  physical states+ information states

  6. Simple “stateless” example Slides prepared by Georgios Chalkiadakis Bernoulli process parameters θ1 θ2 describe the actual (but unknown) probabilities of success Bayesian: Uncertainty about parameters describe it by conjugate prior distributions:

  7. Conjugate Priors Slides prepared by Georgios Chalkiadakis A prior is conjugate to a likelihood function if the posterior is in the same family with the prior Prior in the family, posterior in the family A simple update of hyperparameters is enough to get the posterior!

  8. Information-state transition diagram

  9. It simply becomes: Slides prepared by Georgios Chalkiadakis

  10. Bellman optimality equation (with k steps to go) Slides prepared by Georgios Chalkiadakis

  11. Enter physical states (MDPs) Slides prepared by Georgios Chalkiadakis 2 physical states

  12. Enter physical states (MDPs) Slides prepared by Georgios Chalkiadakis 2 physical states / 2 actions Four Bernoulli processes: 1 at 1, 2 at 1, 1 at 2, 2 at 2 (a_1^1, b_1^1) hyperparameters of beta distribution capturing uncertainty about p^1_{11} full hyperstate: Note: we now have to be in a specific physical state to sample a related process

  13. Enter physical states (MDPs)

  14. Optimality equation

  15. More than 2 physical states…What priors now? Slides prepared by Georgios Chalkiadakis • Dirichlet – conjugate to the multinomial sampling • Sampling is now multinomial : s many s’ • We will see examples in future readings…

  16. Certainty equivalence? • Or, even simpler, consider an horizon of 1 • Compute DP “optimal” policies using means of current belief distributions • perform action, ob • Or, even more simply, use a myopic c-e approach: • Use means of current priors to compute DP optimal policies • Execute “optimal” action, observe transition • Update distribution, repeat Slides prepared by Georgios Chalkiadakis • Truncate the horizon • Compute terminal values using means • …and proceed with a receding-horizon approach • Perform DP , take first “optimal” action, shift window fwd, repeat

  17. No, it’s not a good idea!... Slides prepared by Georgios Chalkiadakis Actions / state transitions might be starved forever, …even if the initial prior is an accurate model of uncertainty!

  18. Example Slides prepared by Georgios Chalkiadakis

  19. Example (cont.)

  20. So, we have to be properly Bayesian Slides prepared by Georgios Chalkiadakis • If the prior is an accurate model of uncertainty, “important” actions/states will not be starved • There exists Bayesian RL algorithms that do a more than a decent job! (future readings) • However, if the prior provides a distorted picture of reality, then we can have no convergence guarantees • …but “optimal learning” is still in place (assuming that other algorithms operate with the same prior knowledge)

More Related