230 likes | 406 Vues
Partially-Observable Markov Decision Processes. Tom Dietterich. Markov Decision Process as a Decision Diagram. Note: We observe before we choose All states, actions, and rewards are observed. What If We Can’t Directly Observe the State?. Note: We observe before we choose
E N D
Partially-Observable Markov Decision Processes Tom Dietterich MCAI 2013
Markov Decision Processas a Decision Diagram Note: We observe before we choose All states, actions, and rewards are observed MCAI 2013
What If We Can’t Directly Observe the State? Note: We observe before we choose Only the observations are observed, not the underlying states MCAI 2013
POMDPs are Hard to Solve • Tradeoff between taking actions to gain information and taking actions to change the world • Some actions can do both MCAI 2013
Optimal Management of Difficult-to-Observe Invasive Species [Regan et al., 2011] • Branched Broomrape (Orobancheramosa) • Annual parasitic plant • Attaches to root system of host plant • Results in 75-90% reduction in host biomass • Each plant makes ~50,000 seeds • Seeds are viable for 12 years MCAI 2013
Quarantine Area in S. Australia • 375 farms; 70km x 70km area Google maps MCAI 2013
Formulation as a POMDP:Single Farm • States: • {Empty, Seeds, Plants & Seeds} • Actions: • {Nothing, Host Denial, Fumigation} • Observations: • {Absent, Present} • Detection probability • Rewards: • Cost(Nothing) Cost(Host Denial) Cost(Fumigation) • Objective: • 20-year discounted reward (discount = 0.96) State Diagram MCAI 2013
Optimal MDP Policy • If plant is detected, Fumigate; Else Do Nothing • Assumes perfect detection www.grdc.com.au MCAI 2013
Optimal POMDP Policy for • Same as the Optimal MDP Policy Action OBSERVATION ABSENT Decision State Fumigate ABSENT Nothing 1 0 After State PRESENT PRESENT MCAI 2013
Optimal Policy for ABS ... Nothing Fumigate Deny Deny ABS ABS ABS 0 1 2 16 PRESENT PRESENT PRESENT PRESENT • Deny Host for 15 years before switching to Nothing • For Deny Host for 17 years before switching to Nothing MCAI 2013
Probability of Eradication MCAI 2013
Discussion • POMDP is exactly solvable because the state space is very small • Real problem is more complex • Each farm can have many fields, each with its own hidden state • There 375 farms in the quarantine area • states if we treat each farm as a single unit • Exact solution of large POMDPs is beyond the state of the art • Notice that there is no tradeoff between acting to gather information and acting to change the world. None of the actions gain information MCAI 2013
Ways to Avoid a POMDP (1) • State Estimation and State Tracking • In many problems, we have (or can acquire) enough sensors so that we can estimate the state quite well • has low uncertainty • Let be the most likely hidden state • In such problems, we can pretend that we have an MDP and we can directly observe • We do not need to take actions to gain information, so we do not face this difficult tradeoff MCAI 2013
Ways to Avoid a POMDP (2) • Pure Information-Gathering POMDPs • Consider a medical diagnosis case for a specific disease where there are tests, that can be performed. Our goal is to decide whether the patient has the disease by choosing tests to perform • Each test has two possible outcomes and • Each test has a cost • Given any subset of the outcomes, we can compute the probability that the patient has the disease • There is a “false positive” cost for incorrectly saying that and a “false negative” cost, for saying that MCAI 2013
Formulation as an MDP • States: • starting state is • Actions • actions are the medical tests • action says “the patient does not have the disease” and terminates with cost 0 if correct and cost if incorrect • action says “the patient has the disease” and terminates with cost 0 if correct and cost if incorrect • State Transitions • When we perform test in state , the resulting state sets the th entry in the state to according to • When we perform a “declare” action, the problem transitions to a terminal state with probability 1 • If there aren’t too many tests and we know , we can enumerate the states and solve this via standard MDP methods MCAI 2013
Belief States • In general, we can think of a POMDP as being an MDP over a Belief State • In the medical diagnosis cases, the belief states have the form (0,1,?,?,0,?) • In the Broomrape case, the belief state is a probability distribution over the 3 states: weeds + seeds empty seeds MCAI 2013
Belief State Reasoning • Each observation updates the belief state • Example: observing the presence of weeds means weeds are present and seeds might also be present observe present weeds + seeds weeds + seeds empty empty seeds seeds MCAI 2013
Taking Actions • Each action updates the belief state • Example: fumigate fumigate weeds + seeds weeds + seeds empty empty seeds seeds MCAI 2013
Belief MDP • State space: all reachable belief states • Action space: same actions as the POMDP • Reward function: expected rewards derived from the underlying states • Transition function: moves in belief space • Problem: Belief space is continuous and there can be an immense number of reachable states MCAI 2013
Monte Carlo Policy Evaluation • Key Insight: It is just as easy to evaluate a policy via Monte Carlo trials in a POMDP as it is an in MDP! • Approach: • Define a space of policies • Evaluate them by Monte Carlo trials • Pick the best one MCAI 2013
Finite State Machine Policies • In many POMDPs (and MDPs), a policy can be represented as a finite state machine • We can design a set of FSM policies and then evaluate them • There are algorithms for incrementally improving FSM policies ABS ... Nothing Fumigate Deny Deny ABS ABS ABS 0 1 2 16 PRESENT PRESENT PRESENT PRESENT MCAI 2013
Summary • Many problems in AI can be formulated as POMDPs • Formulating a problem as a POMDP doesn’t help much, because they are so hard to solve (PSPACE-hard for finite horizon; undecidable for infinite horizon) • Can we do state estimation and pretend ? • Are we performing pure observation actions? • Can the policy be divided into a pure observation phase and a pure action phase? • If so, we can use MDP methods instead • Unfortunately, many problems in ecosystem management are “essential” POMDPs that mix information gathering and world-changing actions • Monte Carlo methods (based on policy space search) are one of the most practical ways of finding good POMDP solutions MCAI 2013