350 likes | 478 Vues
Learning and Planning for POMDPs. Eyal Even-Dar, Tel-Aviv University Sham Kakade , University of Pennsylvania Yishay Mansour, Tel-Aviv University. Talk Outline. Bounded Rationality and Partially Observable MDPs Mathematical Model of POMDPs Learning in POMDPs Planning in POMDPs
E N D
Learning and Planning for POMDPs Eyal Even-Dar,Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour,Tel-Aviv University
Talk Outline • Bounded Rationality and Partially Observable MDPs • Mathematical Model of POMDPs • Learning in POMDPs • Planning in POMDPs • Tracking in POMDPs
Bounded Rationality • Rationality: • Unlimited Computational power players • Bounded Rationality • Computational limitation • Finite Automata • Challenge: play optimally against a Finite Automata • Size of automata unknown
Bounded Rationality and RL • Model: • Perform an action • See an observation • Either immediate rewards or delay reward • This is a POMDP • Unknown size is a serious challenge
Classical Reinforcement LearningAgent – Environment Interaction Agent Agent action Reward Environment Next state
Reinforcement Learning - Goal • Maximize the return. • Discounted return ∑trt 0<<1 • Undiscounted return ∑rt/ T ∞ t=1 T t=1
Reinforcement Learning ModelPolicy • Policy Π: • Mapping states to distribution over • Optimal policy Π*: • Attains optimal return from any start state. • Theorem: There exists a stationary deterministic optimal policy
Planning and Learning in MDPs • Planning: • Input: a complete model • Output: an optimal policy Π*: • Learning: • Interaction with the environment • Achieve near optimal return. • For MDPs both planning and learning can be done efficiently • Polynomial in the number of states • representation in tabular form
Partial ObservableAgent – Environment Interaction Agent Agent action Reward Environment Signal correlated with state
Partially Observable Markov Decision Process • S the states • A actions • Psa(-) next state distribution • R(s,a) Reward distribution • O Observations • O(s,a) Observation distribution O1 = .1 02 = .8 03 = .1 O1 = .8 02 = .1 03 = .1 s2 s1 0.3 0.7 O1 = .1 02 = .1 03 = .8 E[R(s3,a)] = 10 s3
Partial Observables – problems in Planning • The optimal policy is not stationary furthermore it is history dependent • Example:
Partial Observables – Complexity Hardness results LGM01, L95
Learning in PODMPs – Difficulties • Suppose an agent knows its state initially, can he keep track of his state? • Easy given a completely accurate model. • Inaccurate model: Our new tracking result. • How can the agent return to the same state? • What is the meaning of very long histories? • Do we really need to keep all the history?!
Planning in POMDPs – Belief State Algorithm • A Bayesian setting • Prior over initial state • Given an action and observation defines a posterior • belief state: distribution over states • View the possible belief states as “states” • Infinite number of states • Assumes also a “perfect model”
Learning in POMDPs – Popular methods • Policy gradient methods : • Find local optimal policy in a restricted class of polices (parameterized policies) • Need to assume a reset to the start state! • Cannot guarantee asymptotic results • [Peshkin et al, Baxter & Bartlett,…]
Learning in POMDPs • Trajectory trees [KMN]: • Assume a generative model • A strong RESET procedure • Find “near best” policy in a restricted class of polices • finite horizon policies • parameterized policies
Trajectory tree [KMN] s0 a1 a2 o1 o2 a1 a1 a2 a2 o2 o4 o3 o1
Our setting • Return: Average reward criteria • One long trajectory • No RESET • Connected environment (unichain POMDP) • Goal: Achieve the optimal return (average reward) with probability 1
Homing strategies - POMDPs • Homing strategy is a strategy that identifies the state. • Knows how to return “home” • Enables to “approximate reset” in during a long trajectory.
Homing strategies • Learning finite automata [Rivest Schapire] • Use homing sequence to identify the state • The homing sequence is exact • It can lead to many states • Use finite automata learning of [Angluin 87] • Diversity based learning [Rivest Schpire] • Similar to our setting • Major difference: deterministic transitions
Homing strategies - POMDPs Definition: H is an (,K)-homing strategy if for every two belief states x1 and x2, after K steps of following H, the expected belief states b1 and b2 are within distance.
Homing strategies – Random Walk • The POMDP is strongly connected, then the random walk Markov chain is irreducible • Following the random walk assures that we converge to the steady state
Homing strategies – Random Walk • What if the Markov chain is periodic? • a cycle • Use “stay action” to overcome periodicity problems
Homing strategies – Amplifying Claim: If H is an (,K)-homing sequence then repeating H for T times is an (T,KT)-homing sequence
Reinforcement learning with homing • Usually algorithms should balance between exploration and exploitation • Now they should balance between exploration, exploitation and homing • Homing is performed in both exploration and exploitation
Policy testing algorithm Theorem: For any connected POMDP the policy testing algorithm obtains the optimal average reward with probability 1 After T time steps is competes with policies of horizon log log T
Policy testing • Enumerate the policies • Gradually increase horizon • Run in phases: • Test policy πk • Average runs, resetting between runs • Run the best policy so far • Ensures good average return • Again, reset between runs.
Model based algorithm Theorem: For any connected POMDP the model based algorithm obtains the optimal average reward with probability 1 After T time steps is competes with policies of horizon log T
Model based algorithm Exploration • For t=1 to ∞ • For K1(t)times do • Run random for t steps and build an empirical model • Use homing sequence to approximate reset • Compute optimal policy on the empirical model • For K2(t)times do • Run the empirical optimal policy for t steps • Use homing sequence to approximate reset Exploitation
Model based algorithm s0 ~ a2 a1 o1 o2 o1 o2 a1 a2 a2 a1 …………………………………………………………………………
Model based algorithm –Computing the optimal policy • Bounding the error in the model • Significant Nodes • Sampling • Approximate reset • Insignificant Nodes • Compute an ε-optimal t horizon policy in each step
Model Based algorithm- Convergence w.p 1 proof • Proof idea: • At any stage K1(t) is large enough so we compute an t-optimal t horizon policy • K2(t) is large enough such that all phases before influence is bounded by t • For a large enough horizon, the homing sequence influence is also bounded
Model Based algorithmConvergence rate • Model based algorithm produces an -optimal policy with probability 1 - in time polynomial in , |A|,|O|, log(1/ ), Homing sequence length, and exponential in the horizon time of the optimal policy • Note the algorithm does not depend on |S|
Planning in POMDP • Unfortunately, not today … • Basic results: • Tight connections with Multiplicity Automata • Well establish theory starting in the 60’s • Rank of the Hankel matrix • Similar to PSR • Always less then the number of states • Planning algorithm: • Exponential in the rank of the Hankel matrix
Tracking in POMDPs • Belief states algorithm • Assumes perfect tracking • Perfect model. • Imperfect model, tracking impossible • For example: No observable • New results: • “Informative observables” implies efficient tracking. • Towards a spectrum of “partially” …