Analyzing Abstraction and Approximation in MDP/POMDP Environments for Maze Domains

CMPUT 551 Analyzing abstraction and approximation within MDP/POMDP environment Magdalena Jankowska (M.Sc. - Algorithms) Ilya Levner (M.Sc - AI/ML)

OUTLINE • Project Overview • Analytical results • Maze Domain • Experiments • Results • Conclusions and Future Work

MDP environment:Maze domain • states • actions • transitions between states • immediate rewards • Markov property

MDP environment:Maze domain • states • actions • transitions between states • immediate rewards • Markov property •  optimal value V* of each state

Project Overview • Analyze MDP/POMDP domain in the presence of: • State Abstraction • Errors in state transition function • Errors in V* function due to • State Abstraction • Machine Learning • Evaluate effectiveness of lookahead search policy in the presence errors.

Questions • When is the problem MDP ? • if not MDP: can we recast the Markov property? • limited lookahead: does it helps?

No state abstraction,imperfect value function V • MDP • V now can be used as a heuristic • limited lookahead: usually admissible heuristic function • Combining lookahead with learning: • Learning Real Time A* • Real Time Dynamic Programming

“Abstracted” value function • We know where we are but the value function is the same for all states in abstracted state G

“Abstracted” value function In given abstracted state: value is the average over V* of all states in the abstracted state • not admissible • lookahead may help you to get outside the abstraction boundary

Does lookahed always help? G Depth 1

Does lookahed always help? G Depth 3

State abstraction • not Markovian • special case of POMDP • transition from one abstracted state to another and rewards depend on a history • some special cases when it are Markovian

How to recast Markov property? • If we know underlying MDP: updating belief over states Fully observed MDP in belief space • solve the belief MDP • use V* of underlying states as heuristic • Real-Time Dynamic Programming in belief space

How to recast Markov property? • If we do not know the underlying MDP: use the history as part of a state description How long path do we need to use? In general: the whole history Special cases: only part

Error in transition function • Can be crucial • Agent can be easily trapped in loops

Error in the transition function example: no state abstraction, perfect V* G 10 1 2 3 4 5 6 7 8 9 Two actions: right: left:

100% 100% Error in the transition function example: no state abstraction, perfect V* G 10 1 2 3 4 5 6 7 8 9 Two actions: right: left: real:

100% 100% Error in the transition function example: no state abstraction, perfect V* G 10 1 2 3 4 5 6 7 8 9 Two actions: right: left: real: what we think: 35% 65% 65% 35%

Experimental Setup • 48x48 cell maze • 3 Experiments • State Abstraction • Machine Learning (ANN) • State Abstraction and Machine Learning • Error Measurements • Relative Score (global policy error) • Distance to goal (sample score error)

State Abstraction Error(s) • Abstraction Tile size varied • k = 1, 2, 3, 4, 6, 8, 12, 24, 48 • Ply Depth 1 – 7 @ 10 games/ply depth

Machine Learning Error • 2 – h – 100 ANN, inputs (x,y), out V*(s) • Error achieved by varying the number of hidden nodes (h) within a NN (1-20)

State Abstraction + ML Error(s)

Conclusion Most important results: • analysis of lookahead for “abstracted” value function: especially experimentally • demonstration of possible adverse effects of errors in transition function • answers for questions about Markov property and investigation of ways to restore it

Future Work • Improve Policy Error Evaluation Measures • Further analytical work on lookahead

Analyzing Abstraction and Approximation in MDP/POMDP Environments for Maze Domains