1 / 27

CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty

CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty. Jiang Bian, Fall 2012 University of Arkansas at Little Rock. Planning under Uncertainty. Uncertainty. Planning. MDPs PO-MDPs. RL. Learning. Planning Agent Tasks Characteristics.

judd
Télécharger la présentation

CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CPSC 7373: Artificial IntelligenceLecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock

  2. Planning under Uncertainty Uncertainty Planning MDPs PO-MDPs RL Learning

  3. Planning Agent Tasks Characteristics • Stochastic is an environment where the outcome of an action is somewhat random, whereas an environment that's deterministic where the outcome of an action is predictable and always the same. • An environment is fully observable if you can see the state of the environment which means if you can make all decisions based on the momentary sensory input. Whereas if you need memory, it's partially observable.

  4. Markov Decision Process (MDP) a2 a2 S2 S2 50% a1 a1 S1 S1 50% a1 a1 a2 a2 S3 S3 a1 a1 a2 a2 Randomness Finite State Machine Markov Decision Process

  5. Markov Decision Process (MDP) a2 S2 States: S1…Sn Actions: a1…ak State Transition Matrix T(S, a, S’) = P(S’|a,S) Reward function R(S) 50% a1 S1 50% a1 a2 S3 a1 a2

  6. MDP Grid World 1 2 3 4 a Absorbing States b c Stochastic actions: 10% 10% 80% 80% 10% 10% Policy: pi(s) -> A The planning problem we have becomes one of finding the optimal policy

  7. Stochastic Environments – Conventional Planning 1 2 3 4 c1 a b S N W E c b1 c1 c1 c2 • Problems: • Branching factor: 4 choices, 3 outcomes, at least 12 branches we need to follow • Depth of the search tree (i.e., loops, etc.) • Many states visited more than once (i.e., states may re-occur) • In A*, we ensure we only visit each state only once

  8. Policy 1 2 3 4 a b c Goal: Find an optimal policy for all these states that with maximum probability leads me to the absorbing state plus 100 Quiz: What is the optimal action? a1: N, S, W, E ??? c1: N, S, W, E ??? c4: N, S, W, E ??? B3: N, S, W, E ???

  9. MDP and Costs 1 2 3 4 a b c +100, a4 -100, b4 -3, other states (i.e., gives us incentives to shorten our action sequence) R(s) Objective of MDP: γ=discount factor, e.g., γ=0.9, (i.e., decay of the future rewards)

  10. Value Function 1 2 3 4 a b c Value Function: for each state, the value of the state is the expected sum of future discounted reward provided that we start in state S, executed policy PI. Planning = Iteratively calculating value functions

  11. Value Iteration 1 2 3 4 a b c run value iteration through convergence 1 2 3 4 a b c

  12. Value Iteration - 2 1 2 3 4 a b c Back-up equation: If S is the terminal state After converges (Bellman equality): the optimal future cost reward trade off that you can achieve if you act optimally in any given state.

  13. Quiz – DETERMINSTIC 1 2 3 4 DETERMINSTIC, γ= 1, R(S) = -3 a b c V(a3) = ???

  14. Quiz - 1 1 2 3 4 DETERMINSTIC, γ= 1, R(S) = -3 a b c V(a3) = 97 V(b3) = ???

  15. Quiz - 1 1 2 3 4 DETERMINSTIC, γ= 1, R(S) = -3 a b c V(a3) = 97 V(b3) = 94 V(c1) = ???

  16. Quiz - 1 1 2 3 4 DETERMINSTIC, γ= 1, R(S) = -3 a b c V(a3) = 97 V(b3) = 94 V(c1) = 85

  17. Quiz – STOCHASTIC 1 2 3 4 STOCHASTIC, γ= 1, R(S) = -3, P=0.8 a b c V(a3) = ???

  18. Quiz – STOCHASTIC 1 2 3 4 STOCHASTIC, γ= 1, R(S) = -3, P=0.8 a b c V(a3) = 77 V(b3) = ???

  19. Quiz – STOCHASTIC 1 2 3 4 STOCHASTIC, γ= 1, R(S) = -3, P=0.8 a b c V(a3) = 77 V(b3) = 48.6 N: 0.8 * 77 + 0.1(-100) + 0.1*0 – 3 = 48.6 W: 0.1 * 77 + 0.8 * 0 + 0.1 *0 – 3 = 4.7

  20. Value Iteration and Policy - 1 What is the optimal policy?

  21. Value Iteration and Policy - 2 1 2 3 4 STOCHASTIC, γ= 1, R(S) = -3, P=0.8 a b c What is the optimal policy? This is a situation where the risk of falling into the -100 is balanced by the time spent going around. 1 2 3 4 a b c

  22. Value Iteration and Policy - 3 1 2 3 4 STOCHASTIC, γ= 1, R(S) = 0, P=0.8 a b c What is the optimal policy? 1 2 3 4 a b c

  23. Value Iteration and Policy - 4 1 2 3 4 STOCHASTIC, γ= 1, R(S) = -100, P=0.8 a b c What is the optimal policy? 1 2 3 4 a b c

  24. Markov Decision Processes • Fully Observable: S1, …, Sn; a1, …, ak • Stochastic: P(S’|a, S) • Reward: R(S) • Objective: • Value iteration: V(S) • Converges: PI = argmax…

  25. Partial Observability +100 -100 • Fully Observable, Deterministic • Fully Observable, Stochastic

  26. Partial Observability ? ? • Partially Observable, Stochastic MDP vs POMDP: Optimal exploration versus exploitation, where some of the actions might be information-gathering actions; whereas others might be goal-driven actions. SIGN

  27. Partial Observability +100 +100 ? ? 50% 50% Information Space (Belief Space)

More Related