Dr. Itamar Arel College of Engineering Electrical Engineering and Computer Science Department

ECE-517: Reinforcement Learning in Artificial IntelligenceLecture 15: Partially Observable Markov Decision Processes (POMDPs) October 27, 2011 Dr. Itamar Arel College of Engineering Electrical Engineering and Computer Science Department The University of Tennessee Fall 2011

Outline • Why use POMDPs? • Formal definition • Belief state • Value function

Partially Observable Markov Decision Problems (POMDPs) • To introduce POMDPs let us consider an example where an agent learns to drive a car in New York city • The agent can look forward, backward, left or right • It can’t change speed but it can steer into the lane it is looking at • The different types of observations are • the direction in which the agent's gaze is directed • the closest object in the agent's gaze • whether the object is looming or receding • the color of the object • whether a horn is sounding • To drive safely the agent must steer out of its lane to avoid slow cars ahead and fast cars behind

POMDP Example • The agent is in control of the middle car • The car behind is fast and will not slow down • The car ahead is slower • To avoid a crash, the agent must steer right • However, when the agent is gazing to the right, there is no immediate observation that tells it about the impending crash

POMDP Example (cont.) • This is not easy when the agent has no explicit goals beyond “performing well" • There are no explicit training patterns such as “if there is a car ahead and left, steer right." • However, a scalar reward is provided to the agent as a performance indicator (just like MDPs) • The agent is penalized for colliding with other cars or the road shoulder • The only goal hard-wired into the agent is that it must maximize a long-term measure of the reward

POMDP Example (cont.) • Two significant problems make it difficult to learn under these conditions • Temporal credit assignment– • If our agent hits another car and is consequently penalized, how does the agent reason about which sequence of actions should not be repeated, and in what circumstances? • Generally same as in MDPs • Partial Observability - • If the agent is about to hit the car ahead of it, and there is a car to the left, then circumstances dictate that the agent should steer right • However, when it looks to the right it has no sensory information regarding what goes on elsewhere • To solve the latter, the agent needs memory – creates knowledge of the state of the world around it

Forms of Partial Observability • Partial Observability coarsely pertains to either • Lack of important state information in observations – must be compensated using memory • Extraneous information in observations – needs to learn to avoid • In our example: • Color of the car in its gaze is extraneous (unless red cars really drive faster) • It needs to build a memory-based model of the world in order to accurately predict what will happen • Creates “belief state” information (we’ll see later) • If the agent has access to the complete state, such as a chess playing machine that can view the entire board: • It can choose optimal actions without memory • Markov property holds – i.e. future state of the world is simply a function of the current state and action

Modeling the world as a POMDP • Our setting is that of an agent taking actions in a world according to its policy • The agent still receives feedback about its performance through a scalar reward received at each time step • Formally stated, POMDPs consists of … • |S| states S = {1,2,…,|S|} of the world • |U| actions (or controls) U = {1,2,…, |U|} available to the policy • |Y| observations Y = {1,2,…,|Y|} • a (possibly stochastic) reward r(i) for each state i in S

Modeling the world as a POMDP (cont.)

MDPs vs. POMDPs • In MDP: one observation for each state • Concept of observation and state being interchangeable • Memoryless policy that does not make use of internal state • In POMDPs different states may have similar probability distributions over observations • Different states may look the same to the agent • For this reason, POMDPs are said to have hidden state • Two hallways may look the same for a robot’s sensors • Optimal action for the first  take left • Optimal action for the second  take right • A memoryless policy can’t distinguish between the two

MDPs vs. POMDPs (cont.) • Noise can create ambiguity in state inference • Agent’s sensors are always limited in the amount of information they can pick up • One way of overcoming this is to add sensors • Specific sensors that help it to “disambiguate” hallways • Only when possible, affordable or desirable • In general, we’re now considering agents that need to be proactive (also called “anticipatory”) • Not only react to environmental stimuli • Self-create context using memory • POMDP problems are harder to solve, but represent realistic scenarios

POMDP solution techniques – model based methods • If an exact model of the environment is available, POMDPs can (in theory) be solved • i.e. an optimal policy can be found • Like model-based MDPs, it’s not so much a learning problem • No real “learning”, or trial and error taking place • No exploration/exploitation dilemma • Rather a probabilistic planning problem  find the optimal policy • In POMDPs the above is broken into two elements • Belief state computation, and • Value function computation based on belief states

The belief state • Instead of maintaining the complete action/observation history, we maintain a belief state b. • The belief state is a probability distribution over the states • Given an observation • Dim(b) = |S|-1 • The belief space is the entire probability space • We’ll use a two-state POMDP as a running example • Probability of being in state one = p probability of being in state two = 1-p • Therefore, the entire space of belief states can be represented as a line segment

The belief space • Here is a representation of the belief space when we have two states (s0,s1)

The belief space (cont.) • The belief space is continuous, but we only visit a countable number of belief points • Assumption: • Finite action set • Finite observation set • Next belief state b’ = f (b,a,o) where: b: current belief state, a:action, o:observation

The Tiger Problem • Standing in front of two closed doors • World is in one of two states: tiger is behind left door or right door • Three actions: Open left door, open right door, listen • Listening is not free, and not accurate (may get wrong info) • Reward: Open the wrong door and get eaten by the tiger (large –r) Open the right door and get a prize (small +r)

Tiger Problem: POMDP Formulation • Two states: SL and SR (tiger is really behind left or right door) • Three actions: LEFT, RIGHT, LISTEN • Transition probabilities: • Listening does not change thetiger’s position • Each episode is a “Reset” Current state Next state

Tiger Problem: POMDP Formulation (cont.) • Observations: TL (tiger left) or TR (tiger right) • Observation probabilities: Current state Next state Rewards: • R(SL, Listen) = R(SR, Listen) = -1 • R(SL, Left) = R(SR, Right) = -100 • R(SL, Right) = R(SR, Left) = +10

POMDP Policy Tree (Fake Policy) Starting belief state (tiger left probability: 0.3) Listen Tiger roar left Tiger roar right New belief state (0.6) Listen Tiger roar right Open Left door New belief State (0.15) Tiger roar left Open Left door Listen … New belief State (0.9) … Listen

POMDP Policy Tree (cont’) A1 o1 o3 o2 A2 o6 A4 A3 o4 o5 A7 A5 A6 … … A8

How many POMDP policies possible A1 1 o1 o3 o2 A2 A4 |O| o6 A3 o4 o5 A7 A5 A6 |O|^2 … … A8 … • How many policy trees, if |A| actions, |O| observations, T horizon: • How many nodes in a tree: • N =  |O|i= (|O|T- 1)/ (|O| - 1) How many trees: T-1 |A|N i=0

Belief State Overall formula: • The belief state is updated proportionally to: • The prob. of seeing the current observation given state s’, • and to the prob. of arriving at state s’ given the action and our previous belief state (b) • The above are all given by the model

Belief State (cont.) • Let’s look at an example: • Consider a robot that is initially completely uncertain about its location • Seeing a door may, as specified by the model’s occur in three different locations • Suppose that the robot takes an action and observes a T-junction • It may be that given the action only one of the three states could have lead to an observation of a T-junction • The agent now knows withcertainty which state it is in • Not in all cases the uncertaintydisappears like that

Finding an optimal policy • The policy component of a POMDP agent must map the current belief state into action • It turns out that the process of maintaining belief states is a sufficient statistic (i.e. Markovian) • We can’t do better even if we remembered the entire history of observations and actions • We have now transformed the POMDP into a MDP • Good news: we have ways of solving those (GPI algorithms) • Bad news: the belief state space is continuous !!

Value function • The belief state is the input to the second component of the method: the value function computation • The belief state is a point in a continuous space of N-1 dimensions! • The value function must be defined over this infinite space • Application of dynamic programming techniques  infeasible

Value function (cont.) • Let’s assume only two states: S1 and S2 • Belief state [0.25 0.75] indicates b(s1) = 0.25, b(s2) = 0.75 • With two states, b(s1) is sufficient to indicate belief state: b(s2) = 1 – b(s1) V(b) S1 [1, 0] S2 [0, 1] [0.5, 0.5] b: belief state

Piecewise linear and Convex (PWLC) • Turns out that the value function is, or can be accurately approximated, by a piecewise linear and convex function • Intuition on convexity: being certain of a state yields high value, where as uncertainty lowers the value V(b) b: belief state S1 [1, 0] S2 [0, 1] [0.5, 0.5]

Why does PWLC helps? • We can directly work with regions (intervals) of belief space! • The vectors are policies, and indicate the right action to take in each region of the space Vp1 Vp3 V(b) Vp2 region3 region1 region2 S1 [1, 0] S2 [0, 1] [0.5, 0.5] b: belief state

Summary • POMDPs  model realistic scenarios more accurately • Rely on belief states that are derived from observations and actions • Can be transformed into an MDP with PWLC for value function approximation • What if we don’t have a model??? • Next class: (recurrent) neural networks come to the rescue …

Dr. Itamar Arel College of Engineering Electrical Engineering and Computer Science Department