Reinforcement Learning In Rars

Reinforcement Learning In Rars Thomas Pittman

Contents • Introduction to Rars • Current Existing Robots • State Representation • Actions • Reward Functions • Exploration • Algorithms Used • Sarsa Lambda • Sarsa Lambda With Tile Coding • Prioritized Sweeping • Least Squares Policy Iteration • Generalizing The Track • Demos • Future Work

Introduction To Rars Robotic Auto Racing Simulator Available for download at http://rars.sourceforge.net Racing simulator designed for artificial intelligence research. Why is it interesting to us? Complex high dimensional problem. Continuous state and action spaces. Requires approx. 200000 correct actions to be chosen to complete a race. Extremely sensitive to what actions are chosen. Other researchers have tried to solve the problem using reinforcement learning and failed. Able to create an infinite number of environments to work in. Has a set of existing robots to benchmark against

The Competition None of the current champion robots use any sort of actual learning algorithm. They are based either on some sort of path optimization algorithm or a complex track heuristic. The path optimization drivers typically perform better than the heuristic drivers, excepting in situations that required deviating from the optimal path (passing and such). The reigning world champion driver K1999 uses a gradient descent path optimizing algorithm and never deviates from its optimal path. Also, a fair number of the robots rely on pre-built tables that either describe what to do on each individual track or guide the heuristics, since these do not transfer over to random tracks, these robots perform poorly there.

State Representation We have available to us approximately 30 state variables visible to our robot's code. These range from information about our car itself (its velocity, position, fuel, damage, etc), information about the next 3 track segments (segment length, segment radius/angle), information about the other cars (relative positions and velocities to us), and information about the race itself (length of the race, lap times, and such)

Actions There are 5 variables given in Rars to control the action of your car: alpha – The desired turning angle vc – The desired velocity request_pit – A flag to indicate that you want to take a pit stop fuel_amount – The amount of fuel to acquire at the next pit stop repair_amount – The amount of damage to be repaired at the next pit It is important to note, that the alpha and vc parameters are mere requests for what you want to do, the laws of physics still govern your car's motion. So setting vc to 1000 will merely accelerate you as much as possible for your given situation and not set your speed to 1000 mph.

Actions To simplify things, I use a set of 9 actions for my drivers. A combination of acceleration / deceleration / no acceleration and driving straight / turning left / turning right. These are capable of navigating through any of the Rar's tracks. Using a continuous action space might improve things a bit, but likely isn't worth it. Each time step lasts only 0.055 simulated seconds, so fine control is achieved by combining lots of small actions into larger ones. (ex. a sharp turn is achieved by a series of small turns. Actually requesting that sharp turn right away will toss you into an undesirable skid) I handle pitting manually as typically this only is required once per race, and would likely be extremely difficult to learn a proper strategy for. Though it would be interesting to make the driver learn a strategy for dealing with refueling and repairs on its own.

Reward Functions Crafting a suitable reward function for Rars is not a simple task. You need to find some sort of function that keeps it going along on the track, avoids crashing, and is capable of guiding it along at increasingly faster speeds.

Reward Functions The straight forward reward function that comes to mind, is to assign a positive reward for completing a lap, and a negative reward for crashing. Unfortunately, this would require hundreds or perhaps thousands of correct actions to be chosen before a lap can be completed. This is highly unlikely to happen, particularly on the more complicated tracks. Furthermore, there is no initial incentive for choosing an optimal path through the track.

Reward Functions Ultimately, I ended up using the following system for calculating reward: -1 reward on all time steps 0.25 reward for every 10 feet of forward distance traveled (-1 / (1 – gamma)) reward for crashing or going backwards Since standing still results in a reward of (-1 / (1 – gamma)) over time, there is an incentive to move forwards. Similarly, since 0.25 reward is awarded per 10 feet of progress, there is an incentive to not crash into the walls, as more reward can be obtained by progressing onwards. There is also an almost immediate benefit seen from increasing one's speed, guiding the car along to higher speeds.

Exploration Exploration in the RARS world is extremely dangerous, a single random action can send you flying off the track, or send you into a skid. Typically I use an epsilon value of around 0.0025 or so. Initially to encourage exploration, I also optimistically set the value function to 0 for every state.

Algorithms Used Sarsa Lambda Sarsa Lambda With Tile Coding Prioritized Sweeping Least Squares Policy Iteration (LSPI)

Sarsa Lambda Just using a standard sarsa lambda algorithm without a function approximator failed horribly. After running for 7000 races, it never managed to get around the first turn of the Oval track. On average, it survived for about 200 time steps = 11 seconds of simulated time.

Sarsa Lambda With Tile Coding However, if you combine sarsa lambda with a function approximator, such as tile coding, sarsa lambda actually works. A straight table-based lookup simply does not work, as shown by the failure of standard sarsa lambda and prioritized sweeping.

LSPI LSPI is another viable technique for approaching Rars. Its able to achieve quality results in far less overall episodes than sarsa lambda. However, many orders of magnitude of cpu time and memory are required for LSPI. For example, on the 4th episode on the oval2 track, the LSPI driver managed 47 laps, while after 10000 episodes, the sarsa lambda driver failed to even reach 40 laps once.

Sarsa Lambda With 10000 Episodes of Training on Oval2 LSPI With 300 Episodes of Training on Oval2

Generalizing Track Knowledge Generalizing the Knowledge from One Track to Another: Despite being able to create a driver that performs well on one track, we ultimately want to be able to learn how to drive well on any track, even ones that we have never seen before. To do this, we need to find a state representation that is not only generic enough to be useful on multiple tracks, but also complex enough to allow the driver to plan ahead to find an optimal path. Without sufficient knowledge of what future portions of the track look like, the driver is forced to drive at a slow enough speed (approx. 55 mph) to react to any sudden unforeseen changes in the track.

Generalizing Track Knowledge In order to do this, we need to remove any information specific to a particular track (ex. an arrangement of track segments) or even set of tracks from our state representation (ex. track widths). Track Width: The differing widths of the various tracks poses a problem towards generalization. If we don't take into consideration the width, we are going to ultimately learn to drive as if the track where always the nearest track available. We can't just add in a width variable to the state, since we want it to learn that making a 15 degree left turn has the same strategy on any size of track. To solve the width problem, I began representing the distance from the center line of the track as a percentage instead of an absolute distance.

Generalizing Track Knowledge Track Segments: The straightforward way of representing what your position on the track would be to describe the current track segment in terms of its length and radius. However, this ultimately fails us as it will never be able to plan ahead with this method. So, we also need to know about the next track segment. But, this also fails us, as we don't particularly care about the next segment until we get close enough to it, so we add in the distance remaining to be traveled along the current segment. Problem: We now have 5 variables (current length & radius, next length & radius, remaining distance on current segment) that are highly specific to an individual track. So, not only is learning going to be slow because of the much increased dimensionality, but it is not going to be very generic at all.

Generalizing Track Knowledge We can solve part of this problem by representing a segment in terms of its curvature instead of its length and radius: curvature = length / ((length * radius) / velocity / delta_time) This reduces our track position to a set of 3 variables, the current segment's curvature, the next segment's curvature and our distance to the end of the current segment. While this representation still is reasonably specific to a given track, it is still able to generalize extremely well onto other tracks.

Generalizing Track Knowledge Now, what if we dropped the distance to the end of the current segment from our state representation? We still need to know something about what we are heading into, so, lets scale the next curvature by the percentage through the current segment and the current curvature: nextCurvature = ((1 - %) * curvature) + (% * nextCurvature) This roughly tells us how much the future track curves in relation to our current position. This crunches the track segment represention to 2 variables. This has other advantages, such as allowing the driver to ignore the next curve until it gets close enough to matter (where close is also implicitly defined by the car's speed as the curvature equation depends on it). This is also extremely generalizable, as the car learns how to deal with its current segment and how to transistion into the next segment.

Generalizing Track Knowledge Though, how does the scaled curvature method compare to the non-scaled curvature method? The scaled curvature method generalizes much faster to different tracks than the non-scaled method. However, it does not have as “good” of a state representation as the non-scaled method, and so performance ultimately suffers in the end, particularly when trying to learn how to deal with sharp corners. As an example, the next 2 slides show the scaled curvature path with ~100000 episodes of training behind it, and the non-scaled curvature path with ~10000 episodes of training behind it. The non-scaled curvature path is definitely better, and took only 1/10th of the training.

Scaled Curvature Path

Non-Scaled Curvature Path

Future Work Racing the Competition Cars While being able to learn to race around a track by ourselves is an accomplishment, that knowledge does not generalize into a competition with other cars. Paying no attention to the other cars tends to lead to disaster. Interestingly enough, most of the competition cars either have no evasion algorithms or really basic crude attempts at evasion. The cars that do have sophisticated evasion algorithms tend to lose the race, as they are forced onto sub-optimal paths by the other non-evasive cars. However, none of my attempts so far at at teaching evasive behavior have worked so far. The added state variables for representing even one other car's relative position and velocity adds an extra 4 dimensions.

Future Work Different Algorithms: • Pegasus • Gaussian Processes • Retrying Prioritized Sweeping Different Actions/State Spaces: • More lookahead • More discretized actions

Links If you are interested in reading more about reinforcement learning attempts in RARS, I've collected a couple of papers on my website: RARS itself can be downloaded at: http://rars.sourceforge.net These slides will also be made available on my website if you want a copy. If you want to talk to me further, drop by my desk in the AICML office, csc room 2-05. http://www.ualberta.ca/~tpittman/rlai/rars.html

Reinforcement Learning In Rars

Reinforcement Learning In Rars

Presentation Transcript

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

REINFORCEMENT LEARNING

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning