Loading in 2 Seconds...

Create Presentation
Download Presentation

Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Loading in 2 Seconds...

Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

222 Views

Download Presentation
Download Presentation
## Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Reinforcement LearningPartially ObservableMarkov Decision**Processes(POMDP) 主講人：虞台文 大同大學資工所 智慧型多媒體研究室**Content**• Introduction • Value iteration for MDP • Belief States & Infinite-State MDP • Value Function of POMDP • The PWLC Property of Value Function**Reinforcement LearningPartially ObservableMarkov Decision**Processes(POMDP) Introduction 大同大學資工所 智慧型多媒體研究室**Definition MDP**• A Markov decision process is a tuple • S a finite set of states of the world • A a finite set of actions • T:SA (S) state-transition function • R:SA R the reward function**Complete Observability**• Solution procedures for MDPs give values or policies for each state. • Use of these solutions requires that the agent is able to detect the state it is currently in with complete reliability. • Therefore, it is called CO-MDP (completely observable)**Partial Observability**• Instead of directly measuring the current state, the agent makes an observation to get a hint about what state it is in. • How to get hint (guess the state)? • To do an action and take an observation. • The observation can be probabilistic, i.e., it provides hint only. • The ‘state’ will be defined in probability sense.**Observation Model**• a finite set of observations the agent can experience of its world. The probability of getting observationo given that the agent took actiona and landed in states’.**Definition POMDP**A POMDP is a tuple describes an MDP. is the observation function. How to find optimal policy in such an environment?**Reinforcement LearningPartially ObservableMarkov Decision**Processes(POMDP) Value Iteration for MDP 大同大學資工所 智慧型多媒體研究室**Are there any difference on the nature of their optimal**policies? Acting Optimality Finite-Horizon Model Maximize the expected total reward of the next k steps. Infinite-Horizon Discounted Model Maximize the expected discounted total reward.**Stationary vs. Non-Stationary Policies**Finite-Horizon Model The optimal policy is dependent on the number of time steps remained. Use non-stationary policy Infinite-Horizon Discounted Model The optimal policy is independent on the number of time steps remained. Use stationary policy**Stationary vs. Non-Stationary Policies**Finite-Horizon Model The optimal policy is dependent on the number of time steps remained. The remained time steps. Use non-stationary policy Infinite-Horizon Discounted Model The optimal policy is independent on the number of time steps remained. Use stationary policy**Value Functions**Finite-Horizon Model Non-stationary policy Infinite-Horizon Discounted Model Stationary policy**Optimal Policies**Finite-Horizon Model Non-stationary policy Infinite-Horizon Discounted Model Stationary policy**Optimal Policies**Finite-Horizon Model Non-stationary policy Infinite-Horizon Discounted Model Stationary policy**Optimal Policies**Finite-Horizon Model Non-stationary policy To find an optimal policy, do we need to pay infinite time? How about t ? How about Vt(s) Vt1(s) s? How about t if Vt(s) Vt1(s) s?**The MDP has finite number of states.**Value Iteration**Reinforcement LearningPartially ObservableMarkov Decision**Processes(POMDP) Belief States & Infinite-State MDP 大同大學資工所 智慧型多媒體研究室**World (MDP)**POMDP Framework action observation belief state b Agent SE SE: state estimator**Belief States**There are uncountably infinite number of belief states.**1**3-state POMDP 2-state POMDP 0 1 0 1 State Space There are uncountably infinite number of belief states.**State Estimation**There are uncountably infinite number of belief states. State estimation: bt+1=? Given bt,at and ot+1,**Normalization Factor**State Estimation**Normalization Factor**State Estimation Remember these.**It is linear**w.r.t bt Normalization Factor State Estimation**b’**a It is linear w.r.t bt State Transition Function b**It is linear**w.r.t bt State Transition Function Suppose that**What is the reward function?**POMDP = Infinite-State MDP • A POMDP is an MDP with tuple • B a set of Belief states • A the finite set of actions (the same as the original MDP) • :BA(B) state-transition function • :BAR the reward function**Reward Function**The reward function of the original MDP Good news: It is Linear.**Reinforcement LearningPartially ObservableMarkov Decision**Processes(POMDP) Value Function of POMDP 大同大學資工所 智慧型多媒體研究室**V(b)**b 0 1 Value Function over Belief Space Consider a 2-state POMDP: How to obtain the value function in belief space? Can we use the table-based method?**Finding Optimal Policy**• POMDP = Infinite-State MDP • The general method of MDP: • To determine the value function and, then followed by policy improvement. • Value functions • State value function • Action value function**What is**It finds on each iteration. Review Value Iteration Based on finite-horizon value function.**The and**Immediate Reward**a2**a2 a1 a1 b b 0 0 1 1 The and Consider a2-state POMDP with two actions (a1, a2) and three observations (o1, o2 , o3).**a2**a1 a2 a1 b 0 1 Horizon-1 Policy Trees Consider a2-state POMDP with two actions (a1, a2) and three observations (o1, o2 , o3). P1**a2**a1 b 0 1 Horizon-1 Policy Trees It is piecewise linear and convex. (PWLC) Consider a2-state POMDP with two actions (a1, a2) and three observations (o1, o2 , o3). P1**s1**(1,0) (0,0) (1,0) s2 The and How about3-state POMDP and more? It is PWLC. What is the policy?**The and**How about3-state POMDP and more? What is the policy?**The PWLC**• A Piecewise Linear function consists of linear, or hyperplane segments • Linear function: • kth linear segment: • the -vector: • each segment could be represented as**The PWLC**• A Piecewise Linear function consists of linear, or hyperplane segments • Linear function: • kth linear segment: • the -vector: • each segment could be represented as**The and**Value of observation o for doing action a on the current stat b. Immediate reward Prob. of observation o for doing action a on the current stat b.**The and**PWLC Value of observation o for doing action a on the current stat b. Immediate reward Prob. of observation o for doing action a on the current stat b. PWLC? Yes, it is. But, I will defer the proof.**Compute**b a2 a1 a1 0 1 0 1 o1 o3 o2 The and b’**Compute**b a2 a1 a1 0 1 0 1 o1 o3 o2 The and What action will you take if the observation is oiafter a1 is taken? b’**The and**Consider individual observation (o) after action (a) is taken. Define**a2**a1 a1 0 1 0 1 a1 a2 0 1 The and Transformed value function**a1**0 1 1 1 0 1 0 0 The and**a1**0 1 o1 o2 o3 1 1 0 1 0 0 The and