1 / 35

Partially Observable Markov Decision Process (POMDP)

Douglas Aberdeen, National ICT Australia 2003. Partially Observable Markov Decision Process (POMDP). by Ye Fang Department of Computer Science Rice University. Overview. Recap POMDP and the exact solution Heuristic methods Heuristics for Exact methods Grid methods Factored belief states

Télécharger la présentation

Partially Observable Markov Decision Process (POMDP)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Douglas Aberdeen, National ICT Australia 2003 Partially Observable Markov Decision Process (POMDP) by Ye Fang Department of Computer Science Rice University

  2. Overview • Recap POMDPand the exact solution • Heuristic methods • Heuristics for Exact methods • Grid methods • Factored belief states • Simulation • Methods for continuous state and action space • Solution

  3. Learning with a Model • The agent knows the model , , • Observation/action history: • Belief state 1/3 1/3 1/3 Goal 1/2 1/2 1

  4. Learning with a Model • Update beliefs: • Long-term value of a belief state • Define:

  5. Complexity of Exact Methods • Exponential number of state variables: • Updating believe state is expensive. • Believe-state monitoring is hard. • Exponential number of belief states: • PSPACE-Hard for simplified finite-horizon POMDP. • NP-Hard to find a policy.

  6. How to make POMDP feasible? • Almost impossible to find a exact solution for POMDP model • Where does the complexity of exact solution come from? • Infinite believe states • Updating believe states and their value functions • Introduce heuristic methods for exact methods

  7. How to make POMDP feasible? • Why can Heuristics work? • Simplify the representation of value function by assuming the system is an MDP. • Replace the believe state b with real world state

  8. Heuristic for Exact Methods • The intuition behind these heuristics is to assume the system as an MDP by finding an approximate projection from belief state to world state.

  9. Heuristic for Exact Methods • Goal: • Find an good approximation of projection from belief state to world state. • Find a good policy for each believe state.

  10. Heuristic for Exact Methods • MSL(most likely state) • Voting Heuristics • QMDP Heuristic • Heuristic using the uncertainty of belief state

  11. MLS Heuristic • We can assume the system is in the most likely world state(MLS) i at time t. The policy executed at that state is the transition with largest Q-value at state i.

  12. MLS Heuristic • This method neglects all possible world states but the MSL state at belief state b. • EX: Given optimal action in a world with three states and two actions, u(s0) = a0, u(s1) = a0, u(s2) = a1 b = [0.3, 0.3, 0.4]

  13. Voting Heuristic • The voting heuristic assigns a probability distribution over the actions instead of over the states. • Given: • The action for each belief state:

  14. Voting Heuristic • EX: Given optimal action in a world with three states and two actions, u(s0) = a0, u(s1) = a0, u(s2) = a1 b = [0.3, 0.3, 0.4], V(s0, a0)=5, V(s0, a1)=4, V(s1,a0)=5, V(s1,a1)=4, V(s2,a0)=0,V(s2,a1)= 10. At state s2, expectedR(a0) =3, expectedR(a1) =6.4

  15. Voting Heuristic • This method does not take the reward of an action into account. • Introduce QMDP, which emphasize the Q-function of the optimal policy rather than the policy itself.

  16. QMDP Heuristic • QMDP only takes into account the belief state at first step. • What if this action does not do much to disambiguate the state, this method cannot improve the action over time.

  17. Shortcomings of the Heuristics • What if the belief states is close to uniform? • Ex: a robot trying to reach the other end of a futureless desert. By observation, it has almost same belief of it is at everywhere. • What if there is a lot of uncertainty in the information state? • Consider the uncertainty when taking action

  18. Formal measurement of Uncertainty • Entropy is the measure of a probability distribution that reflects how spiked or spread out the probability mass is, essentially capturing the amount of uncertainty with a single number. • f(.) is a discrete probability mass function.

  19. Two Objectives • When choosing actions, we want to: • To take actions that will yield the highest rewards. • To reduce the entropy of information state.

  20. Weighted Entropy Control • Intuition: relate the entropy to the rewards to give some rough measure of the value of information.

  21. Weighted Entropy Control • When the entropy is near 1, it means the environment is totally unobservable. • When the entropy near 0, it means the model is almost a MDP.

  22. Weighted Entropy Control • Define VL to be the lower bound for POMDP value function. • The value at each belief state is: • The control strategy will be:

  23. Other Heuristics for POMDP • Grid method • Factored belief method • Simulation

  24. Grid method • Instead of compose the world state from belief state, it picks the real world states. • How to choose the set of real world states (a interesting region of each belief state)? • How to interpolate?

  25. How to choose grid points? • Simulationto find useful points • Adding points where the value differed a lot though with similar observation.

  26. How to interpolate? • Maintain the convex nature of the value function: • f(g,u) is the value grid point g under action u. • Example: nearest neighbors, linear interpolations, etc.

  27. Factored Belief State • Intuition: learn the dependency of state variables • Ex: at time t: the state of raining is true at time t+1: the state of “ground is dry” is not very likely to be true.

  28. Factored Belief State • We can use a subset of state variables to construct a Bayes network(BN). • Belief-state projection can be searched to find a suitable BN for a specific problem(belief monitoring). It is a learning of adjusting the belief networkparametrized by ϕ. • Factored linear value function: weighted linear combinations of polynomial basis functions.

  29. Simulation and Belief State • Concentrate learning effort on the states that are most likely to be encountered. • In terms of Q-learning, we can simulate a path in POMDP and perform iteration of the value function on the monitored current belief states. • Not good for POMDP with more than hundreds of states.

  30. Simulation and Belief State • Learn Q-function that generalize to all belief states • Artificial neural network can also be used to approximate the value function of the full belief states.

  31. Continuous State and Action Spaces • Sampled belief states. • Use particle filters to update the belief state. • The value function is approximated using the average of k nearest neighbors.

  32. Policy search Vs Value search • It is simpler to just determine how to act instead of the value of acting. • Approximate value function method usually produce deterministic policies. • The heuristic methods are approximated projections from belief states to world states. • It is better to introduce randomness in policy.

  33. Policy search Vs Value search • Policy search can be very difficult. • Value search can be better for small POMDP . • Value search imposes Bellman equations as constrains.

  34. Policy search • Policy search can be implemented by policyiteration. • Step1: Evaluate current policy • Step2: Improve policy

  35. Recap • Different Heuristics • Projecting a belief state to a world state • Evaluating the values for belief states • Finding good policy

More Related