140 likes | 300 Vues
Svetlana Lockwood Washington State University CptS 540 Fall 2010. Reinforcement learning . Background. Dates back to the early days of cybernetics. Goal : to program agents by reward and punishment without needing to specify how the task is to be achieved. BUT
Svetlana Lockwood Washington State University CptS 540 Fall 2010 Reinforcement learning
Background • Dates back to the early days of cybernetics • Goal: to program agents by reward and punishment without needing to specify how the task is to be achieved • BUT • Generally, RL agent faces MarkovDecision Problem (MDP) • Formidable computational obstacles
Introduction • Informal definition: an agent must learn behavior through trial-and-error interactions with a dynamic environment • Two main approaches: • Search in the space of behaviors • mainly this approach has been taken in genetic algorithms and genetic programming • Use statistical techniques and dynamicprogramming methods to estimate the utility of taking actionsin states of the world
Reinforcement Learning (RL) • Differs from supervised learning: • no input/output pairs • agent is told the immediate reward and subsequent state but nottold which action would be best in long-term interests • to act optimally, agent must gather experience about possible system states, actions, transitions and rewards • Another important difference: • system evaluation is often concurrent with learning • does not require predefined state-action transition
Formal Overview s Function R defines the reward r, function I defines how agent sees the world, i.e. full or partial observability. I R Environment: Mars i r Agent: Spirit a
Formal Definition • Formally, the model consists of • a discrete set of environment states, S • a discrete set of agent actions, A • a set of scalar reinforcement signals (r), typically {0, 1} but also may be real numbers The agent's job is to find a policy π, mapping states to actions, that maximizes measure of reinforcement. Environment generally is non-deterministic, i.e.same actions in same state but different times maylead to different state.
Models of Optimal Behavior • Three major models: • finite-horizon: optimize expected reward after h steps • infinite-horizon discounted model: takes into account long term rewards, but they are geometrically discounted, mathematically tractable, 0<γ<1 • average-reward model:
Exploitation versus Exploration: The Single-State Case k-armed bandit problem: The agent is in a room with a collection of k gambling machines and is permitted a fixed number of pulls, h. Any arm may be pulled with payoff 1 or 0 according to some unknown probability distribution. No penalty for pulling arm, the only cost is in wasting a pull. What should the agent's strategy be? Dynamic-Programming Approach Gittins Allocation Indices Learning Automata
Dynamic-Programming Approach • If agent lives h steps, then we can use Bayesian reasoning, requires prior joint probability distribution • {n1, w1, …nk, wk} - system current state after pulling k arms • V*(n1, w1, …nk, wk) the max remaining reward then the remaining rewards are 0. This is basis for recursive definition If If we know the value for all belief states with t pulls remaining, we can compute the value of any belief state with t+1 pulls remaining. Expense: linear in SxA, thus exponential in horizon.
Other approaches to k-bandit • Greedy approach: choose action with highest payoff. • Randomized strategy: take a random action with the best estimated expected reward. Start with a large p to encourage initial exploration, which is slowly decreased.
RL: general case • In the general case of the reinforcement learning problem with multiple states • the agent's actions determine not only its immediate reward, but also the next state of the environment. • Such environments can be thought of as networks of k-bandit problems.
Q-Learning • Value iteration approach Goal: to learn state-to-action function Q that maximizes the expected returns, i.e. • Complexity: quadratic in S and linear in A
Applications • Active area of research • Just scratched the top of the iceberg • Some applications include: • Military • Robotics • Visual images and speech processing • Etc.
David is 11 years old. He weighs 60 pounds. He is 4 feet, 6 inches tall. He has brown hair. His love is real. But he is not. A Steven Spielberg Film Artificial Intelligence