230 likes | 237 Vues
Deep RL in Games Research @IBM. Gerry Tesauro Principal Research Staff Member IBM T.J.Watson Research Center < gtesauro AT us DOT ibm DOT com> http://researcher.watson.ibm.com/researcher/view.php?person=us-gtesauro Joint work with: Janusz Marecki (IBM, Google DeepMind) Joe Bigus (IBM)
E N D
Deep RL in Games Research @IBM Gerry Tesauro Principal Research Staff Member IBM T.J.Watson Research Center <gtesauro AT us DOT ibm DOT com> http://researcher.watson.ibm.com/researcher/view.php?person=us-gtesauro Joint work with: Janusz Marecki (IBM, Google DeepMind) Joe Bigus (IBM) Ban Kawas (IBM) Kamil Rocki (IBM)
History of Games @ IBM backgammon chess checkers Go Jeopardy!
New TD-Gammon Results! (Tesauro, 1992)
Towards Vision-Based Maze Navigation • Use one of the earliest FPS (“First Person Shooter”) computer (DOS) games – Wolfenstein3D(1992) • Display shows live 3-dimensional visual depiction of the environment from the “first-person” perspective • By contrast, depiction of Atari game state is “flat” • Game consists of a series of mazes or “levels” – goal is to exit each level while defeating enemies, pick up useful supplies (ammunition, food, medical etc.)
Why Study 3D Maze Games ?? • Instance of challenging POMDPs • Each maze has unknown layout • Player cannot infer the full game state from current visual frame – need to maintain history of past observations • Clear Metrics to Measure Progress • Point Scores, time to clear each level • High quality simulation model • Training in simulation usually more effective than live training • Potential Competition with Expert Humans • “adds spice to the study” (Samuel) • “provides a convincing demonstration for those who do not believe that machines can learn” (Samuel)
Highly Simplified Initial Task • Eliminate objects, weapons, enemies • Only goal is to find the exit (“First Person Non-Shooter”) • Create simplified maze, colors, textures • (Ambiguity increasesthe challenge) • Simplify legal actions • Three discrete actions: (1) slight move forward; (2) slight turn left; (3) slight turn right
Interface Learner to Game Engine Wolf3D exe (run in dosbox) Send Keystrokes Screen capture; Write Frames to file Shell script Read action from file Write action to file Load image Python NN
Inputs: current and previous observations (frames) and actions: First Hidden layer is a (previously trained) Autoencoder layer (RBM) Recurrent LSTM variant just implemented QNN Learner Architecture ot at-1, ot-1 Outputs (Q-values) at-n, ot-n H2 input H1
No knowledge of 2-D or 3-D vision, no knowledge of 2-D topology of pixels, no knowledge of 2-D layout of maze Learner only gets two types of rewards: (1) Reward = +1 if goal is reached (more than 15% of pixels are red) (2) Reward = -0.002 per time step Results of Maximal “Purist” Approach
Try adding a penalty if the agent is detected to be in a “stuck” state Makes the learner avoid going forward: disaster Add “partial credit” reward ~0.1 if the goal (red pixels) is visible, and gets closer (increase in red pixels): helps finish the epoch Add a fourth “U-turn” action: randomized turn 180o +/- 70o Immediately cures the stuck state Highly randomizing if explored frequently Hard-wired constraint on use of U-turn: U-turn is disabled if the agent is not stuck U-turn is mandatory if the agent is stuck Hope that this will eventually be learnable Minimal Knowledge to Add ?
At beginning of learning, 100% random exploration; still takes a long time to stumble upon the goal state Initial Results with U-turn etc.
Basic wall-following behavior Unanticipated strategy to maximize cumulative reward Training Results with U-turn
RL for Non-Player Characters in Virtual Worlds • Massive Multi-player Online Games: • World of Warcraft (~10 million users) • Open-Ended Virtual Worlds: • users create/add their own environment (terrain, buildings, objects, even laws of physics!) • Second Life • Active Worlds
Games Could Drive RL toward Strong AI • Text-Based Adventure Games (e.g. Zork series) • puzzle-solving, qualitative physics, commonsense reasoning • room descriptions, actions etc. all communicated by natural language interface • need an implicit sense of making progress
Learning backgammon using TD() • Neural net observes a sequence of input patterns x1, x2, x3, …, xf : sequence of board positions occurring during a game • Representation: Raw board description (# of White or Black checkers at each location) using simple truncated unary encoding. (“hand-crafted features” added in later versions) • At final position xf, reward signal z given: • z = 1 if White wins; • z = 0 if Black wins • Train neural net using gradient version of TD() • Trained NN output Vt = V (xt , w) should estimate prob (White wins | xt )
Q: Who makes the moves?? • A: Let neural net make the moves itself, using its current evaluator: score all legal moves, and pick max Vt for White, or min Vt for Black. • Hopelessly non-theoretical and crazy: • Training V using non-stationary (no convergence proof) • Training V using nonlinear func. approx. (no cvg. proof) • Random initial weights Random initial play! Extremely long sequence of random moves and random outcome Learning seems hopeless to a human observer • But what the heck, let’s just try and see what happens...
TD-Gammon can teach itself by playing games against itself and learning from the outcome • Works even starting from random initial play and zero initial expert knowledge (surprising!) achieves strong intermediate play • add hand-crafted features: advanced level of play (1991) • 2-ply search: strong master play (1993) • 3-ply search: superhuman play (1998)