1 / 36

CS 416 Artificial Intelligence

CS 416 Artificial Intelligence. Lecture 19 Making Complex Decisions Chapter 17. Skip Chapter 16 and move to 17. Making Complex Decisions Consider making a sequence of decisions. Complex decisions. Obvious answer: (up, up, right, right, right) Insert unreliability of motor control

jason
Télécharger la présentation

CS 416 Artificial Intelligence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 416Artificial Intelligence Lecture 19 Making Complex Decisions Chapter 17

  2. Skip Chapter 16 and move to 17 • Making Complex Decisions • Consider making a sequence of decisions

  3. Complex decisions • Obvious answer: (up, up, right, right, right) • Insert unreliability of motor control • T (s, a, s’) is the probof reaching state s’from state s afterexecuting action a • Markovian assumption

  4. How about other search techniques • Genetic Algorithms • Let each “gene” be asequence of L, R, U, D • Length unknown • Poor feedback • Simulated annealing?

  5. Building a policy • How might we acquire and store a solution? • Is this a search problem? • Isn’t everything? • Avoid local mins • Avoid dead ends • Avoid needless repetition • Key observation: if the number of states is small, consider evaluating states rather than evaluating action sequences

  6. Complex decisions • Reward at each state: R(s) • Two goals states that terminate game (+1 and -1) • Each non-goal has utility of -0.04 • Total utility is sum of state rewards • So reach (4, 3) quickly

  7. Markov Decision Processes (MDP) • Initial State • S0 • Transition Model • T (s, a, s’) • How does Markov apply here? • Uncertainty is possible • Reward Function • R(s) • For each state

  8. Maximizing utility in MDPs • Fixed strategies won’t work • [Up, up, right, right, right] won’t always result in upper corner • Instead, develop a policy (p) for each state • If first action fails to behave as intended, the policy for the unintended state will make the most of the situation • For a complete policy, no matter the outcome of an action, the agent will know what to do next

  9. Building a policy • Specify a solution for any initial state • Construct a policy that outputs the best action for any state • policy = p • policy in state s = p(s) • Complete policy covers all potential input states • Optimal policy, p*, yields the highest expected utility • Why expected? • Transitions are stochastic

  10. Using a policy • An agent in state s • s is the percept available to agent • p*(s) outputs an action that maximizes expected utility • The policy is a description of a simple reflex

  11. Example solutions Move quicklyand aim forgoal Die! Typos in book! - - - - R(s) = -0.04 Ultra conservative Stayin’ Alive

  12. Example solutions Move quicklyand aim forgoal Die! Typos in book! - - - - R(s) = -0.04 Ultra conservative Stayin’ Alive

  13. Striking a balance • Different policies demonstrate balance between risk and reward • Only interesting in stochastic environments (not deterministic) • Characteristic of many real-world problems • Building the optimal policy is the hard part!

  14. Attributes of optimality • We wish to find policy that maximizes the utility of agent during lifetime • Maximize U([s0, s1, s2, …, sn)] • But is length of lifetime known? • Finite horizon – number of state transitions is known • After timestep N, nothing matters • U([s0, s1, s2, …, sn)] = U([s0, s1, s2, …, sn, sn+1, sn+k)] for all k>0 • Infinite horizon – always opportunity for more state transitions

  15. Time horizon • Consider spot (3, 1) • Let horizon = 3 • Let horizon = 8 • Let horizon = 20 • Let horizon = inf • Does p* change? • Nonstationary optimal policy because policy changes as a function of how many steps remain in lifetime

  16. Evaluating state sequences • Assumption • If I say I will prefer state a to state b tomorrowI must also say I prefer state a to state b today • State preferences are stationary • Additive Rewards • U[(a, b, c, …)] = R(a) + R(b) + R(c) + … • Discounted Rewards • U[(a, b, c, …)] = R(a) + gR(b) + g2R(c) + … • g is the discount factor between 0 and 1 • What does this mean?

  17. Evaluating infinite horizons • How can we compute the sum of infinite horizon? • U[(a, b, c, …)] = R(a) + R(b) + R(c) + … • If discount factor, g, is less than 1 • note Rmax is finite by definition of MDP

  18. Evaluating infinite horizons • How can we compute the sum of infinite horizon? • If the agent is guaranteed to end up in a terminal state eventually • We’ll never actually have to compare infinite strings of states • We can allow g to be 1 and still compute sum

  19. Evaluating a policy • Each policy, p, generates multiple state sequences • Uncertainty in transitions according to T(s, a, s’) • Policy value is an expected sum of discounted rewards observed over all possible state sequences

  20. Value Iteration • Building an optimal policy • Calculate the utility of each state • Use the state utilities to select an optimal action in each state • Your policy is simple – go to the state with the best utility • Your state utilities must be accurate • Through an iterative process you assign correct values to the state utility values

  21. Utility of states • The utility of a state s is… • the expected utility of the state sequences that might follow it • The subsequent state sequence is a function of p(s) • The utility of a state given policy p is…

  22. Example • Let g = 1 and R(s) = -0.04Notice: • Utilities higher near goalreflecting fewer –0.04steps in sum

  23. Restating the policy • I had said you go to state with highest utility • Actually… • Go to state with maximum expected utility • Reachable state with highest utility may have low probability of being obtained • Function of: available actions, transition function, resulting states

  24. Putting pieces together • We said the utility of a state was: • The policy is maximum expected utility • Therefore, utility of a state is the immediate reward for that state and expected utility of next state

  25. What a deal • Much cheaper to evaluate: • Instead of: • Richard Bellman invented the top equation • Bellman equation (1957)

  26. Example of Bellman Equation • Revisit 4x3 example • Utility at cell (1, 1) • Consider all outcomes of all possible actions to select best action and assign its expected utility to value of next-state in Bellman equation

  27. Using Bellman Equations to solve MDPs • Consider a particular MDP • n possible states • n Bellman equations (one for each state) • n equations have n unknowns (U(s) for each state) • n equations and n unknowns… I can solve this, right? • No, because of nonlinearity caused by argmax( ) • We’ll use an iterative technique

  28. Iterative solution of Bellman equations • Start with arbitrary initial values for state utilities • Update the utility of each state as a function of its neighbors • Repeat this process until an equilibrium is reached

  29. Bellman Update • Iterative updates look like this • After infinite Bellman updates, we are guaranteed to reach an equilibrium that solves Bellman equations • The solutions are unique • The corresponding policy is optimal • Sanity check… utilities for states near goal will settle quickly and their neighbors in turn will settle • Information is propagated through state space via local updates

  30. Convergence of value iteration

  31. Convergence of value iteration • How close to optimal policy am I at after i Bellman updates? • Book shows how to calculate error at time i as a function of the error at time i –1 and discount factor g • Mathematically rigorous due to contraction functions

  32. Policy Iteration • Imagine someone gave you a policy • How good is it? • Assume we know g and R • Eyeball it? • Try a few paths and seehow it works? • Let’s be more precise…

  33. Policy iteration • Checking a policy • Just for kicks, let’s compute autility (at this particular iterationof the policy, i) for each state according to Bellman’s equation

  34. Policy iteration • Checking a policy • But we don’t know Ui(s’) • No problem • n Bellman equations • n unknowns • equations are linear • We can solve for the n unknownsin O(n3) time using standardlinear algebra methods

  35. Policy iteration • Checking a policy • Now we know U(s) for all s • For each s, compute • This is the best action • If this action is differentfrom policy, update thepolicy

  36. Policy Iteration • Often the most efficient approach • State space must be small because O(n3) in time • Approximations are possible • Rather than solve for U exactly, approximate with a speedy iterative technique • Explore (update the policy of) only a subset of total state space • Don’t bother updating parts you think are bad

More Related