1 / 69

Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Reinforcement Learning Partially Observable Markov Decision Processes (POMDP). 主講人:虞台文 大同大學資工所 智慧型多媒體研究室. Content. Introduction Value iteration for MDP Belief States & Infinite-State MDP Value Function of POMDP The PWLC Property of Value Function.

darci
Télécharger la présentation

Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reinforcement LearningPartially ObservableMarkov Decision Processes(POMDP) 主講人:虞台文 大同大學資工所 智慧型多媒體研究室

  2. Content • Introduction • Value iteration for MDP • Belief States & Infinite-State MDP • Value Function of POMDP • The PWLC Property of Value Function

  3. Reinforcement LearningPartially ObservableMarkov Decision Processes(POMDP) Introduction 大同大學資工所 智慧型多媒體研究室

  4. Definition  MDP • A Markov decision process is a tuple • S a finite set of states of the world • A a finite set of actions • T:SA (S)  state-transition function • R:SA R  the reward function

  5. Complete Observability • Solution procedures for MDPs give values or policies for each state. • Use of these solutions requires that the agent is able to detect the state it is currently in with complete reliability. • Therefore, it is called CO-MDP (completely observable)

  6. Partial Observability • Instead of directly measuring the current state, the agent makes an observation to get a hint about what state it is in. • How to get hint (guess the state)? • To do an action and take an observation. • The observation can be probabilistic, i.e., it provides hint only. • The ‘state’ will be defined in probability sense.

  7. Observation Model •  a finite set of observations the agent can experience of its world. The probability of getting observationo given that the agent took actiona and landed in states’.

  8. Definition  POMDP A POMDP is a tuple describes an MDP. is the observation function. How to find optimal policy in such an environment?

  9. Reinforcement LearningPartially ObservableMarkov Decision Processes(POMDP) Value Iteration for MDP 大同大學資工所 智慧型多媒體研究室

  10. Are there any difference on the nature of their optimal policies? Acting Optimality Finite-Horizon Model Maximize the expected total reward of the next k steps. Infinite-Horizon Discounted Model Maximize the expected discounted total reward.

  11. Stationary vs. Non-Stationary Policies Finite-Horizon Model The optimal policy is dependent on the number of time steps remained. Use non-stationary policy Infinite-Horizon Discounted Model The optimal policy is independent on the number of time steps remained. Use stationary policy

  12. Stationary vs. Non-Stationary Policies Finite-Horizon Model The optimal policy is dependent on the number of time steps remained. The remained time steps. Use non-stationary policy Infinite-Horizon Discounted Model The optimal policy is independent on the number of time steps remained. Use stationary policy

  13. Value Functions Finite-Horizon Model Non-stationary policy Infinite-Horizon Discounted Model Stationary policy

  14. Optimal Policies Finite-Horizon Model Non-stationary policy Infinite-Horizon Discounted Model Stationary policy

  15. Optimal Policies Finite-Horizon Model Non-stationary policy Infinite-Horizon Discounted Model Stationary policy

  16. Optimal Policies Finite-Horizon Model Non-stationary policy To find an optimal policy, do we need to pay infinite time? How about t ? How about Vt(s)  Vt1(s) s? How about t if Vt(s)  Vt1(s) s?

  17. The MDP has finite number of states. Value Iteration

  18. Reinforcement LearningPartially ObservableMarkov Decision Processes(POMDP) Belief States & Infinite-State MDP 大同大學資工所 智慧型多媒體研究室

  19. World (MDP) POMDP Framework action observation belief state b Agent SE  SE: state estimator

  20. Belief States There are uncountably infinite number of belief states.

  21. 1 3-state POMDP 2-state POMDP 0 1 0 1 State Space There are uncountably infinite number of belief states.

  22. State Estimation There are uncountably infinite number of belief states. State estimation: bt+1=? Given bt,at and ot+1,

  23. Normalization Factor State Estimation

  24. Normalization Factor State Estimation Remember these.

  25. It is linear w.r.t bt Normalization Factor State Estimation

  26. b’ a It is linear w.r.t bt State Transition Function b

  27. It is linear w.r.t bt State Transition Function Suppose that

  28. What is the reward function? POMDP = Infinite-State MDP • A POMDP is an MDP with tuple • B a set of Belief states • A the finite set of actions (the same as the original MDP) •  :BA(B)  state-transition function •  :BAR  the reward function

  29. Reward Function The reward function of the original MDP Good news: It is Linear.

  30. Reinforcement LearningPartially ObservableMarkov Decision Processes(POMDP) Value Function of POMDP 大同大學資工所 智慧型多媒體研究室

  31. V(b) b 0 1 Value Function over Belief Space Consider a 2-state POMDP: How to obtain the value function in belief space? Can we use the table-based method?

  32. Finding Optimal Policy • POMDP = Infinite-State MDP • The general method of MDP: • To determine the value function and, then followed by policy improvement. • Value functions • State value function • Action value function

  33. What is It finds on each iteration. Review  Value Iteration Based on finite-horizon value function.

  34. The and Immediate Reward

  35. a2 a2 a1 a1 b b 0 0 1 1 The and Consider a2-state POMDP with two actions (a1, a2) and three observations (o1, o2 , o3).

  36. a2 a1 a2 a1 b 0 1 Horizon-1 Policy Trees Consider a2-state POMDP with two actions (a1, a2) and three observations (o1, o2 , o3). P1

  37. a2 a1 b 0 1 Horizon-1 Policy Trees It is piecewise linear and convex. (PWLC) Consider a2-state POMDP with two actions (a1, a2) and three observations (o1, o2 , o3). P1

  38. s1 (1,0) (0,0) (1,0) s2 The and How about3-state POMDP and more? It is PWLC. What is the policy?

  39. The and How about3-state POMDP and more? What is the policy?

  40. The PWLC • A Piecewise Linear function consists of linear, or hyperplane segments • Linear function: • kth linear segment: • the -vector: • each segment could be represented as

  41. The PWLC • A Piecewise Linear function consists of linear, or hyperplane segments • Linear function: • kth linear segment: • the -vector: • each segment could be represented as

  42. The and Value of observation o for doing action a on the current stat b. Immediate reward Prob. of observation o for doing action a on the current stat b.

  43. The and PWLC Value of observation o for doing action a on the current stat b. Immediate reward Prob. of observation o for doing action a on the current stat b. PWLC? Yes, it is. But, I will defer the proof.

  44. The and

  45. Compute b a2 a1 a1 0 1 0 1 o1 o3 o2 The and b’

  46. Compute b a2 a1 a1 0 1 0 1 o1 o3 o2 The and What action will you take if the observation is oiafter a1 is taken? b’

  47. The and Consider individual observation (o) after action (a) is taken. Define

  48. a2 a1 a1 0 1 0 1 a1 a2 0 1 The and Transformed value function

  49. a1 0 1 1 1 0 1 0 0 The and

  50. a1 0 1 o1 o2 o3 1 1 0 1 0 0 The and

More Related