RL Successes and Challenges in High-Dimensional Games

RL Successes and Challenges in High-Dimensional Games Gerry Tesauro IBM T.J.Watson Research Center

Outline • Overview/Definition of “Games” • Why Study Games? • Commonalities of RL successes • RL in Classic Board Games • TD-Gammon, KnightCap, TD-Chinook, RLGO • RL in Robotics Games • Attacker/Defender Robots • Robocup Soccer • RL in Video/Online Games • AI Fighters • Open Discussion / Lessons Learned

What Do We Mean by “Games” ?? • Some Definitions of “Game” • A structured activity, usually undertaken for enjoyment (Wikipedia) • Activity among decision-makers in seeking to achieve objectives in a limiting context (Clark Abt) • A form of play with goals and structure (Kevin Maroney) • Single-Player Game = “Puzzle” • “Competition” if players can’t interfere with other players’ performance • Olympic Hockey vs. Olympic Figure Skating • Common Ingredients: Players, Rules, Objective • But: Games with modifiable rules, no clear object (MOOs)

Why Use Games for RL/AI ?? • Clean, Idealized Models of Reality • Rules are clear and known (Samuel: not true in economically important problems) • Can build very good simulators • Clear Metric to Measure Progress • Tournament results, Elo ratings, etc. • Danger: Metric takes on a life of its own • Competition spurs progress • DARPA Grand Challenge, Netflix competition • Man vs. Machine Competition • “adds spice to the study” (Samuel) • “provides a convincing demonstration for those who do not believe that machines can learn” (Samuel)

How Games Extend “Classic RL” Complex motivation • Fourth dimension: non-stationarity “Motivated” RL Multi-agent game strategy Poker Robocup Soccer Chicken AI Fighters Lifelike environment backgammon, chess, etc.

Ingredients for RL success • Several commonalities: • Problems are more-or-less MDPs (near full observability, little history dependence) • |S| is enormous  can’t do DP • State-space representation critical: use of “features” based on domain knowledge • Train in a simulator! Need lots of experience, but still << |S| • Smooth function approximation (linear or NN) → very aggressive generalization/extrapolation • Only visit plausible states; only generalize to plausible states

RL + Gradient Parameter Training • Recall incremental Bellman updates (TD(0)) • If instead V(s) = V (s), adjust  to reduce MSE (R-V(s))2 by gradient descent:

TD() training of neural networks (episodic; =1 and intermediate r = 0):

RL in Classic Board Games

Bbar 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Wbar

Learning backgammon using TD() • Neural net observes a sequence of input patterns x1, x2, x3, …, xf : sequence of board positions occurring during a game • Representation: Raw board description (# of White or Black checkers at each location) using simple truncated unary encoding (“hand-crafted features” added in later versions) • 1-D geometry → 28 board locations → 200 “raw” input units → 300 input units incl. features • Train neural net using gradient version of TD() • Trained NN output Vt = V (xt , w) should estimate prob (White wins | xt )

TD-Gammon can teach itself by playing games against itself and learning from the outcome • Works even starting from random initial play and zero initial expert knowledge (surprising!)  achieves strong intermediate play • add hand-crafted features: advanced level of play (1991) • 2-ply search: strong master play (1993) • 3-ply search: superhuman play (1998)

New TD-Gammon Results! (Tesauro, 1992)

Extending TD(λ) to TDLeaf • Checkers and Chess: 2-D geometry, 64 board locations, dozens to thousands (Deep Blue) of features, linear function approximation • Samuel had the basic idea: train value of current state to match minimax backed-up value • Proper mathematical formulation proposed by Beal & Smith; Baxter et al. • Baxter’s Chess program KnightCap showed rapid learning in play vs. humans: 1650→2150 Elo in only 300 games! • Schaeffer et al. retrained weights of Checkers program Chinook using TDLeaf + self-play; as strong as manually tuned weights (5 year effort)

RL in Computer Go • Go: 2-D geometry, 361 board locations, hundreds to millions (RLGO) of features, linear or NN function approximation • NeuroGo (M. Enzenberger, 1996; 2003) • Multiple reward signals: single-point eyes, connections and live points • Rating ~1880 in 9x9 Go using 3-ply α-β search • RLGO (D. Silver, 2008) uses only primitive local features and a linear value function. Can do live on-the-fly training for each new position encountered in a Go game! • Rating ~2130 in 9x9 Go using α-β search (avg. depth ~6): strongest program not based on Monte-Carlo Tree Search

RL in Robotics Games

Robot Air Hockey • video at: http://www.cns.atr.jp/~dbent/mpeg/hockeyfullsmall.avi • D. Bentivegna & C. Atkeson, ICRA 2001 • 2-D spatial problem • 30 degree-of-freedom arm, 420 decisions/sec • hand-built primitives, supervised learning + RL

WoLF in Adversarial Robot Learning • Gra-WoLF (Bowling & Veloso): Combines WoLF (“Win or Learn Fast”) principle with policy gradient RL (Sutton et al., 2000) • again 2-D spatial geometry, 7 input features, 16 CMAC tiles • video at: http://webdocs.cs.ualberta.ca/~bowling/videos/AdversarialRobotLearning.mp4

RL in Robocup Soccer • Once again, 2-D spatial geometry • Much good work by Peter Stone et al. • TPOT-RL: Learned advanced team strategies given limited observability – key to CMUnited victories in late 90s • Fast Gait for Sony Aibo dogs • Ball Acquisition for Sony Aibo dogs • Keepaway in Robocup simulation league

Robocup “Keepaway” Game (Stone et al.) • Uses Robocup simulator, not real robots • Task: one team (“keepers”) tries to maintain possession of the ball as long as possible, other team (“takers”) try to take away • Keepers are trained using continuous-time, semi-Markov version of Sarsa algorithm • Represent Q(s,a) using CMAC (coarse tile coding) function approximation • State representation: small # of distances and angles between teammates, opponents, and ball • Reward = time of possession • Results: learned policies do much better than either random or hand-coded policies, e.g. on 25x25 field: • learned TOP 15.0 sec, hand-coded 8.0 sec, random 6.4 sec

RL in Video Games

AI Fighters • Graepel, Herbrich & Gold, 2004 – used commercial game platform Tao Feng (runs on Xbox): real time simulator (3D!) • basic feature set + SARSA + linear value function • multiple challenges of environment (real time, concurrency,…): • opponent state not known exactly • agent state and reward not known exactly • due to game animation, legal moves are not known

Links to AI Fighters videos: before training: http://research.microsoft.com/en-us/projects/mlgames2008/taofengearlyaggressive.wmv after training: http://research.microsoft.com/en-us/projects/mlgames2008/taofenglateaggressive.wmv

Discussion / Lessons Learned ?? • Winning formula: hand-designed features (fairly small number) + smooth function approx. • hand-designed features (fairly small number) • aggressive smooth function approx. • Researchers should try raw-input comparisons and try nonlinear function approx. • Many/most state variables in real problems seem pretty irrelevant • Opportunity to try recent linear and/or nonlinear Dimensionality Reduction algorithms • Sparsity constraints (L1 regularization etc.) also promising • Brain/retina architecture impressively suited for 2-D spatial problems • More studies using Convolutional Neural Nets etc.

RL Successes and Challenges in High-Dimensional Games

RL Successes and Challenges in High-Dimensional Games

Presentation Transcript

AGOA: Successes and Challenges

Multicultural Britain – Successes and Challenges

Challenges and Successes in MRNet

International Chapter Challenges and Successes

5/9: Successes and Challenges

Successes, Opportunities, Challenges

Challenges and Successes of Creating…

New Successes and Challenges

18.3 New Successes and Challenges

Successes and Challenges

AVEC Challenges and Successes

Successes Challenges Vision

More Successes and Challenges

Obesity: Myths, Challenges, and Successes

PCR Optimization: Challenges and Successes

Successes and Challenges

SUCCESSES, CHALLENGES AND OPPORTUNITIES

Overview, Successes, and Challenges

Malaria Successes and Challenges in Asia

Interdisciplinary Collaboration: Challenges and Successes

Florida’s Plan: Successes and Challenges

Coastal Wetlands: Successes and Challenges