1 / 35

Learning and Planning for POMDPs

Learning and Planning for POMDPs. Eyal Even-Dar, Tel-Aviv University Sham Kakade , University of Pennsylvania Yishay Mansour, Tel-Aviv University. Talk Outline. Bounded Rationality and Partially Observable MDPs Mathematical Model of POMDPs Learning in POMDPs Planning in POMDPs

tracey
Télécharger la présentation

Learning and Planning for POMDPs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning and Planning for POMDPs Eyal Even-Dar,Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour,Tel-Aviv University

  2. Talk Outline • Bounded Rationality and Partially Observable MDPs • Mathematical Model of POMDPs • Learning in POMDPs • Planning in POMDPs • Tracking in POMDPs

  3. Bounded Rationality • Rationality: • Unlimited Computational power players • Bounded Rationality • Computational limitation • Finite Automata • Challenge: play optimally against a Finite Automata • Size of automata unknown

  4. Bounded Rationality and RL • Model: • Perform an action • See an observation • Either immediate rewards or delay reward • This is a POMDP • Unknown size is a serious challenge

  5. Classical Reinforcement LearningAgent – Environment Interaction Agent Agent action Reward Environment Next state

  6. Reinforcement Learning - Goal • Maximize the return. • Discounted return ∑trt 0<<1 • Undiscounted return ∑rt/ T ∞ t=1 T t=1

  7. Reinforcement Learning ModelPolicy • Policy Π: • Mapping states to distribution over • Optimal policy Π*: • Attains optimal return from any start state. • Theorem: There exists a stationary deterministic optimal policy

  8. Planning and Learning in MDPs • Planning: • Input: a complete model • Output: an optimal policy Π*: • Learning: • Interaction with the environment • Achieve near optimal return. • For MDPs both planning and learning can be done efficiently • Polynomial in the number of states • representation in tabular form

  9. Partial ObservableAgent – Environment Interaction Agent Agent action Reward Environment Signal correlated with state

  10. Partially Observable Markov Decision Process • S the states • A actions • Psa(-) next state distribution • R(s,a) Reward distribution • O Observations • O(s,a) Observation distribution O1 = .1 02 = .8 03 = .1 O1 = .8 02 = .1 03 = .1 s2 s1 0.3 0.7 O1 = .1 02 = .1 03 = .8 E[R(s3,a)] = 10 s3

  11. Partial Observables – problems in Planning • The optimal policy is not stationary furthermore it is history dependent • Example:

  12. Partial Observables – Complexity Hardness results LGM01, L95

  13. Learning in PODMPs – Difficulties • Suppose an agent knows its state initially, can he keep track of his state? • Easy given a completely accurate model. • Inaccurate model: Our new tracking result. • How can the agent return to the same state? • What is the meaning of very long histories? • Do we really need to keep all the history?!

  14. Planning in POMDPs – Belief State Algorithm • A Bayesian setting • Prior over initial state • Given an action and observation defines a posterior • belief state: distribution over states • View the possible belief states as “states” • Infinite number of states • Assumes also a “perfect model”

  15. Learning in POMDPs – Popular methods • Policy gradient methods : • Find local optimal policy in a restricted class of polices (parameterized policies) • Need to assume a reset to the start state! • Cannot guarantee asymptotic results • [Peshkin et al, Baxter & Bartlett,…]

  16. Learning in POMDPs • Trajectory trees [KMN]: • Assume a generative model • A strong RESET procedure • Find “near best” policy in a restricted class of polices • finite horizon policies • parameterized policies

  17. Trajectory tree [KMN] s0 a1 a2 o1 o2 a1 a1 a2 a2 o2 o4 o3 o1

  18. Our setting • Return: Average reward criteria • One long trajectory • No RESET • Connected environment (unichain POMDP) • Goal: Achieve the optimal return (average reward) with probability 1

  19. Homing strategies - POMDPs • Homing strategy is a strategy that identifies the state. • Knows how to return “home” • Enables to “approximate reset” in during a long trajectory.

  20. Homing strategies • Learning finite automata [Rivest Schapire] • Use homing sequence to identify the state • The homing sequence is exact • It can lead to many states • Use finite automata learning of [Angluin 87] • Diversity based learning [Rivest Schpire] • Similar to our setting • Major difference: deterministic transitions

  21. Homing strategies - POMDPs Definition: H is an (,K)-homing strategy if for every two belief states x1 and x2, after K steps of following H, the expected belief states b1 and b2 are within  distance.

  22. Homing strategies – Random Walk • The POMDP is strongly connected, then the random walk Markov chain is irreducible • Following the random walk assures that we converge to the steady state

  23. Homing strategies – Random Walk • What if the Markov chain is periodic? • a cycle • Use “stay action” to overcome periodicity problems

  24. Homing strategies – Amplifying Claim: If H is an (,K)-homing sequence then repeating H for T times is an (T,KT)-homing sequence

  25. Reinforcement learning with homing • Usually algorithms should balance between exploration and exploitation • Now they should balance between exploration, exploitation and homing • Homing is performed in both exploration and exploitation

  26. Policy testing algorithm Theorem: For any connected POMDP the policy testing algorithm obtains the optimal average reward with probability 1 After T time steps is competes with policies of horizon log log T

  27. Policy testing • Enumerate the policies • Gradually increase horizon • Run in phases: • Test policy πk • Average runs, resetting between runs • Run the best policy so far • Ensures good average return • Again, reset between runs.

  28. Model based algorithm Theorem: For any connected POMDP the model based algorithm obtains the optimal average reward with probability 1 After T time steps is competes with policies of horizon log T

  29. Model based algorithm Exploration • For t=1 to ∞ • For K1(t)times do • Run random for t steps and build an empirical model • Use homing sequence to approximate reset • Compute optimal policy on the empirical model • For K2(t)times do • Run the empirical optimal policy for t steps • Use homing sequence to approximate reset Exploitation

  30. Model based algorithm s0 ~ a2 a1 o1 o2 o1 o2 a1 a2 a2 a1 …………………………………………………………………………

  31. Model based algorithm –Computing the optimal policy • Bounding the error in the model • Significant Nodes • Sampling • Approximate reset • Insignificant Nodes • Compute an ε-optimal t horizon policy in each step

  32. Model Based algorithm- Convergence w.p 1 proof • Proof idea: • At any stage K1(t) is large enough so we compute an t-optimal t horizon policy • K2(t) is large enough such that all phases before influence is bounded by t • For a large enough horizon, the homing sequence influence is also bounded

  33. Model Based algorithmConvergence rate • Model based algorithm produces an -optimal policy with probability 1 -  in time polynomial in , |A|,|O|, log(1/ ), Homing sequence length, and exponential in the horizon time of the optimal policy • Note the algorithm does not depend on |S|

  34. Planning in POMDP • Unfortunately, not today … • Basic results: • Tight connections with Multiplicity Automata • Well establish theory starting in the 60’s • Rank of the Hankel matrix • Similar to PSR • Always less then the number of states • Planning algorithm: • Exponential in the rank of the Hankel matrix

  35. Tracking in POMDPs • Belief states algorithm • Assumes perfect tracking • Perfect model. • Imperfect model, tracking impossible • For example: No observable • New results: • “Informative observables” implies efficient tracking. • Towards a spectrum of “partially” …

More Related