1 / 23

Deep RL in Games Research @IBM

Deep RL in Games Research @IBM. Gerry Tesauro Principal Research Staff Member IBM T.J.Watson Research Center < gtesauro AT us DOT ibm DOT com> http://researcher.watson.ibm.com/researcher/view.php?person=us-gtesauro Joint work with: Janusz Marecki (IBM, Google DeepMind) Joe Bigus (IBM)

heenan
Télécharger la présentation

Deep RL in Games Research @IBM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Deep RL in Games Research @IBM Gerry Tesauro Principal Research Staff Member IBM T.J.Watson Research Center <gtesauro AT us DOT ibm DOT com> http://researcher.watson.ibm.com/researcher/view.php?person=us-gtesauro Joint work with: Janusz Marecki (IBM, Google DeepMind) Joe Bigus (IBM) Ban Kawas (IBM) Kamil Rocki (IBM)

  2. History of Games @ IBM backgammon chess checkers Go Jeopardy!

  3. New TD-Gammon Results! (Tesauro, 1992)

  4. Towards Vision-Based Maze Navigation • Use one of the earliest FPS (“First Person Shooter”) computer (DOS) games – Wolfenstein3D(1992) • Display shows live 3-dimensional visual depiction of the environment from the “first-person” perspective • By contrast, depiction of Atari game state is “flat” • Game consists of a series of mazes or “levels” – goal is to exit each level while defeating enemies, pick up useful supplies (ammunition, food, medical etc.)

  5. Wolfenstein3D Demo

  6. Why Study 3D Maze Games ?? • Instance of challenging POMDPs • Each maze has unknown layout • Player cannot infer the full game state from current visual frame – need to maintain history of past observations • Clear Metrics to Measure Progress • Point Scores, time to clear each level • High quality simulation model • Training in simulation usually more effective than live training • Potential Competition with Expert Humans • “adds spice to the study” (Samuel) • “provides a convincing demonstration for those who do not believe that machines can learn” (Samuel)

  7. Highly Simplified Initial Task • Eliminate objects, weapons, enemies • Only goal is to find the exit (“First Person Non-Shooter”) • Create simplified maze, colors, textures • (Ambiguity increasesthe challenge) • Simplify legal actions • Three discrete actions: (1) slight move forward; (2) slight turn left; (3) slight turn right

  8. Interface Learner to Game Engine Wolf3D exe (run in dosbox) Send Keystrokes Screen capture; Write Frames to file Shell script Read action from file Write action to file Load image Python NN

  9. Inputs: current and previous observations (frames) and actions: First Hidden layer is a (previously trained) Autoencoder layer (RBM) Recurrent LSTM variant just implemented QNN Learner Architecture ot at-1, ot-1 Outputs (Q-values) at-n, ot-n H2 input H1

  10. No knowledge of 2-D or 3-D vision, no knowledge of 2-D topology of pixels, no knowledge of 2-D layout of maze Learner only gets two types of rewards: (1) Reward = +1 if goal is reached (more than 15% of pixels are red) (2) Reward = -0.002 per time step Results of Maximal “Purist” Approach

  11. Try adding a penalty if the agent is detected to be in a “stuck” state Makes the learner avoid going forward: disaster Add “partial credit” reward ~0.1 if the goal (red pixels) is visible, and gets closer (increase in red pixels): helps finish the epoch Add a fourth “U-turn” action: randomized turn 180o +/- 70o Immediately cures the stuck state Highly randomizing if explored frequently Hard-wired constraint on use of U-turn: U-turn is disabled if the agent is not stuck U-turn is mandatory if the agent is stuck Hope that this will eventually be learnable Minimal Knowledge to Add ?

  12. At beginning of learning, 100% random exploration; still takes a long time to stumble upon the goal state Initial Results with U-turn etc.

  13. Basic wall-following behavior Unanticipated strategy to maximize cumulative reward Training Results with U-turn

  14. Improvement over Initial Random Policy

  15. Future of RL in Games ??

  16. RL for Non-Player Characters in Virtual Worlds • Massive Multi-player Online Games: • World of Warcraft (~10 million users) • Open-Ended Virtual Worlds: • users create/add their own environment (terrain, buildings, objects, even laws of physics!) • Second Life • Active Worlds

  17. Games Could Drive RL toward Strong AI • Text-Based Adventure Games (e.g. Zork series) • puzzle-solving, qualitative physics, commonsense reasoning • room descriptions, actions etc. all communicated by natural language interface • need an implicit sense of making progress

  18. Backup Slides

  19. 47

  20. Learning backgammon using TD() • Neural net observes a sequence of input patterns x1, x2, x3, …, xf : sequence of board positions occurring during a game • Representation: Raw board description (# of White or Black checkers at each location) using simple truncated unary encoding. (“hand-crafted features” added in later versions) • At final position xf, reward signal z given: • z = 1 if White wins; • z = 0 if Black wins • Train neural net using gradient version of TD() • Trained NN output Vt = V (xt , w) should estimate prob (White wins | xt )

  21. 49

  22. Q: Who makes the moves?? • A: Let neural net make the moves itself, using its current evaluator: score all legal moves, and pick max Vt for White, or min Vt for Black. • Hopelessly non-theoretical and crazy: • Training V using non-stationary  (no convergence proof) • Training V using nonlinear func. approx. (no cvg. proof) • Random initial weights  Random initial play! Extremely long sequence of random moves and random outcome  Learning seems hopeless to a human observer • But what the heck, let’s just try and see what happens...

  23. TD-Gammon can teach itself by playing games against itself and learning from the outcome • Works even starting from random initial play and zero initial expert knowledge (surprising!)  achieves strong intermediate play • add hand-crafted features: advanced level of play (1991) • 2-ply search: strong master play (1993) • 3-ply search: superhuman play (1998)

More Related