1 / 67

Active Learning in POMDPs

Active Learning in POMDPs. Robin JAULMES Supervisors: Doina PRECUP and Joelle PINEAU McGill University rjaulm@cs.mcgill.ca. Outline. 1) Partially Observable Markov Decision Processes (POMDPs) 2) Active Learning in POMDPs 3) The MEDUSA algorithm. Markov Decision Processes(MDPs).

arvin
Télécharger la présentation

Active Learning in POMDPs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Active Learning in POMDPs Robin JAULMES Supervisors: Doina PRECUP and Joelle PINEAU McGill University rjaulm@cs.mcgill.ca

  2. Outline • 1) Partially Observable Markov Decision Processes (POMDPs) • 2) Active Learning in POMDPs • 3) The MEDUSA algorithm.

  3. Markov Decision Processes(MDPs) • Markov Decision Processes: • States S • Actions A • Probabilistic transitions P(s’|s,a) • Immediate Rewards R(s,a) • A discount factor  • The current state is always perfectly observed.

  4. Partially Observable Markov Decision Processes (POMDPs) • A POMDP has: • States S • Actions A • Probabilistic transitions • Immediate Rewards • A discount factor • Observations Z • Observation probabilities • An initial belief b0

  5. Applications of POMDPs • The ability to render environments in which the state is not fully observed can allow applications in: • Dialogue management • Vision • Robot navigation • High-level control of robots • Medical diagnosis • Network maintenance

  6. A POMDP example: The Tiger Problem

  7. The Tiger Problem • Description: • 2 states: Tiger_Left, Tiger_Right • 3 actions: Listen, Open_Left, Open_Right • 2 observations: Hear_Left, Hear_Right • Rewards are: • -1 for the Listen action • -100 for the Open_Left in the Tiger_Left state • +10 for the Open_Right in the Tiger_Left state

  8. The Tiger Problem • Furthermore: • The hear action does not change the state • The open action puts the tiger behind any door with 50% chance. • The open action leads to A a useless observation (50% hear_left, 50% hear_right) • The hear action gives the correct information 85% of the time.

  9. Solving a POMDP • To solve a POMDP is to find, for any action/observation history, the action that maximizes the expected discounted reward:

  10. The belief state • Instead of maintaining the complete action/observation history, we maintain a belief state b. • The belief is a probability distribution over the states. Dim(b) = |S|-1

  11. The belief space Here is a representation of the belief space when we have two states (s0,s1)

  12. The belief space Here is a representation of the belief state when we have three states (s0,s1,s2)

  13. The belief space Here is a representation of the belief state when we have four states (s0,s1,s2,s3)

  14. The belief space • The belief space is continuous but we only visit a countable number of belief points.

  15. The Bayesian update

  16. Value Function in POMDPs • We will compute the value function over the belief space. • Hard: the belief space is continuous !! • But we can use a property of the optimal value function for a finite horizon: it is piecewise-linear and convex. • We can represent any finite-horizon solution by a finite set of alpha-vectors. • V(b) = max_α[Σ_s α(s)b(s)]

  17. Alpha-Vectors • They are a set of hyperplanes which define the belief function. At each belief point the value function is equal to the hyperplane with the highest value.

  18. Value Iteration in POMDPs • Value iteration: • Initialize value function (horizon 1 value) • V(b) = max_a Σ_s R(s,a) b(s) This produces 1 alpha-vector per action. • Compute the value function at the next iteration using Bellman’s equation: • V(b)= max_a [Σ_s R(s,a)b(s)+Σ_s’[T(s,a,s’)O(s’,a,z)α(s’)]]

  19. PBVI: Point-Based Value Iteration • Always keep a bounded number of alpha vectors. • Use value iteration starting from belief points on a grid to produce new sets of alpha vectors. • Stop after n steps (finite horizon). • The solution is approximate but found in a reasonable amount of time and memory. • Good tradeoff between computation time and quality See [Pineau et al., 2003]

  20. Learning a POMDP • What happens if we don’t know for sure the model of the POMDP? • We have to learn it. • The two solutions in the literature are: • EM-based approaches (prone to local minima) • History-based approach (require of the order of 1,000,000 samples for 2 state problems) [Singh et al. 2003]

  21. Active Learning • In an Active Learning Problem the learner has the ability to influence its training data. • The learner asks for what is the most useful given its current knowledge. • Methods to find the most useful query have been shown by Cohn et al. (95)

  22. Active Learning (Cohn et al. 95) • Their method, used for function approximation tasks, is based on finding the query that will minimize the estimated variance of the learner. • They showed how this could be done exactly: • For a mixture of Gaussians model. • For locally weighted regression.

  23. Applying Active Learning to POMDPs • We will suppose in this work that we have an oracle to determine the hidden state of a system on request. • However, this action is costly and we want to use it as little as possible. • In this setting, the active learning query will be to ask for the hidden state.

  24. Applying Active Learning to POMDPs • We propose two solutions: • Integrate the model uncertainty and the query possibility inside the POMDP framework to take advantage of existing algorithms. • The MEDUSA algorithm. It uses a Dirichlet distribution over possible models to determine which actions to take and which queries to ask.

  25. Decision-Theoretic Model Learning • We want to integrate in the POMDP model the fact that: • We have only a rough estimation of its parameters. • The agent can query the hidden state. • These queries should not be used too often, and only used to learn.

  26. Decision-Theoretic Model Learning • So we modify our POMDP: • For each uncertain parameter we introduce an additional state feature. This feature is discretized into n levels. • At initialization we are uniformly distributed among these n groups of states but we remain in this group as the transitions occur. • We introduce a query action that returns the hidden state. • This action is attached to a negative reward Rq. • Then we solve this new POMDP using the usual methods.

  27. Decision-Theoretic Model Learning

  28. D-T Planning: Results

  29. DT-Planning: Conclusions • Theoretically sound, but: • The results are very sensitive to the value of the query penalty, which is therefore very difficult to establish. • The number of states becomes exponential in the number of uncertain parameters ! This increases greatly the complexity of the problem. • With MEDUSA, we leave the theoretical guarantees of optimality to get a tractable algorithm.

  30. MEDUSA: The main ideas • Markovian Exploration with Decision based on the Use of Samples Algorithm • Use Dirichlet distributions to represent current knowledge about the parameters of the POMDP model. • Sample models from the distribution. • Use models to take actions that could be good. • Use queries to improve current knowledge.

  31. Dirichlet distributions • Let XЄ [0 ; 1 ;2 ; ... N]. X is drawn from a multinomial distribution ofparameters (θ1,... θN) iff p(X=i)= θi • The Dirichlet distribution is a distribution of multinomial distribution parameters (of (θ1,... θN) tuples such that θi > 0 and Σ θi = 1)

  32. Dirichlet distributions • Dirichlet distributions have parameters <α1… αN> s.t. αi>0. • We can sample from Dirichlet distributions by using Gamma distributions. • The most likely parameters in a Dirichlet distribution are the following:

  33. Dirichlet distributions • We can also compute the probability of multinomial distribution parameters according to the Dirichlet.

  34. The MEDUSA algorithm • Step 1: initialize the Dirichlet distribution. • Step 2: sample k(=20) POMDPs from the Dirichlet distribution and compute their probabilities according to the Dirichlet. Normalize them to get the weights. • Step 3: solve the k POMDPs with an approximate method (PBVI, finite horizon)

  35. The MEDUSA algorithm • Step 4: run the experiment… At each time step: • Compute the optimal actions for all POMDPs. • Execute one of them. • Update the belief for each POMDP. • If some conditions are met, do a state query. Update the Dirichlet parameters according to this query.

  36. The MEDUSA algorithm • At each time step: • Recompute the POMDP weights • At fixed intervals, erase the POMDP with the lowest weight and redraw another according to the current Dirichlet distribution. • Compute the belief of the new POMDP according to the action-observation history until current time.

  37. Theoretical analysis • We can compute the policy to which MEDUSA converges with an infinite number of models using integrals over the whole space of models. • Under some assumptions over the POMDP, we can prove that MEDUSA converges to the true model.

  38. MEDUSA on Tiger Evolution of mean discounted reward with time steps (query at every step)

  39. Diminishing the complexity • The algorithm is flexible. We can have a wide variety of priors. • Some parameters may be certain. They can also be dependent (if we use the same alpha parameters for different distributions) • So if we have additional information about the POMDP’s dynamics we can therefore diminish the number of alpha- parameters.

  40. Diminishing the complexity • On the Tiger problem: • if we know that: • The “hear” action does not change the state • The problem is symmetric • Opening a door brings an uninformative observation and puts back the tiger with a 0.5 probability behind each door. • We can diminish the number of alpha-parameters from 24 to 2.

  41. MEDUSA on simplified Tiger Evolution of mean discounted reward with time steps (query at every step) Blue: normal problem Black: simplified problem

  42. Learning without query • The alternate beliefß keeps track of the knowledge brought by the last query. • The non-query update updates each parameter proportionally to the probability a query would have of updating it, given the alternate belief and the last action/observation.

  43. Learning without query • Non-query learning: • Has high variance: learning rate needs to be lower, therefore more time steps are needed. • Is prone to local minima. Convergence to the correct values is not guaranteed. • Can converge to the right solution if the initial prior is “good enough”. • MEDUSA should use non-query learning when it has done “enough” query learning.

  44. Choosing when to query • There are different heuristics to choose when to do a query. • Always (up to a certain number of queries). • When models disagree. • When value functions for the models are different. • When the beliefs in the different models differ. • When information from last query has been lost (?). • Not when a query would bring no information.

  45. Choosing when to query • There is different heuristics to choose when to do a query: • Always (up to a certain number of queries) • When models disagree. • When value functions for the models differ. • When the beliefs in the different models differ. • When information from last query has been lost. • Not when a query would bring no information.

  46. Non-Query Learning on Tiger Mean discounted reward Number of queries Blue: Query learning Black: NQ learning

  47. Picking actions during learning • Take one model and do its best action. • Consider every model, every action, do the action with highest overall value. • Compute the mean value of every action, probabilistically take one of them according to the Boltzman method.

  48. Picking actions during learning • Take one model and do its best action. • Consider every model, every action, do the action with highest overall value. • Compute the mean value of every action, probabilistically take one of them according to the Boltzman method.

  49. Different action-pickings on Tiger Evolution of mean discounted reward with time steps (query at every step) Blue: Highest overall value Black: Pick one model

  50. Learning with non-stationary models • If the parameters of the model unpredictably change with time: • At every time step decay alpha parameters by some factor  so that new experience weighs more than old experience. • If the parameters do not vary too much, non-query learning is sufficient to keep track of their evolution.

More Related