Genetic Programming on General Purpose Graphics Processing Units GPGPGPU

1. Genetic Programming on General Purpose Graphics Processing Units (GPGPGPU)

2. Overview Graphics Processing Units (GPUs) are no longer limited to be used only for Graphics: High degree of programmability Fast floating point operations GPUs are now GPGPUs Genetic programming is a computationally intensive methodology so a prime candidate for using GPUs. 2

3. Outline Genetic Programming Genetic Programming Resource Demands GPU Programming Genetic Programming on GPU Automatically Defined Functions 3

4. Genetic Programming (GP) Evolutionary algorithm-based methodology To optimize a population of computer programs Tree based representation Example: 4 Many problems from various fields can be interpreted as the problem of discovering an appropriate computer program that maps some input to some output: Optimal control Planning Sequence induction Symbolic regression Empirical discovery Decision tree induction Evolution of emergent behaviourMany problems from various fields can be interpreted as the problem of discovering an appropriate computer program that maps some input to some output: Optimal control Planning Sequence induction Symbolic regression Empirical discovery Decision tree induction Evolution of emergent behaviour

5. GP Resource Demands GP is notoriously resource consuming CPU cycles Memory Standard GP system, 1�s per node Binary trees, depth 17: 131 ms per tree Fitness cases: 1,000 Population size: 1,000 Generations: 1,000 Number of runs: 100 Runtime: 10 Gs � 317 years Standard GP system, 1ns per node Runtime: 116 days Limits to what we can approach with GP 5

6. Sources of Speed-up 6

7. General Purpose Computation on GPU 7 The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programmability, have made graphics hardware a compelling platform for computationally demanding tasks in a wide variety of application domains. The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programmability, have made graphics hardware a compelling platform for computationally demanding tasks in a wide variety of application domains.

8. Why GPU is faster than CPU ? 8

9. GPU Programming APIs There are a number of toolkits available for programming GPUs. CUDA MS Accelerator RapidMind Shader programming So far, researchers in GP have not converged on one platform 9

10. CUDA Programming 10 Massive number (>10000) of light-weight threads. High number of threads. Threads are grouped into thread blocks. Each thread has a unique id in a block. Thread blocks are grouped into a grid. Each block has a unique id in a grid. Grids are executed on a device (i.e. GPU). Massive number (>10000) of light-weight threads. High number of threads. Threads are grouped into thread blocks. Each thread has a unique id in a block. Thread blocks are grouped into a grid. Each block has a unique id in a grid. Grids are executed on a device (i.e. GPU).

11. CUDA Memory Model 11 Both host (CPU) and device (GPU) manage their own memory, host memory and device memory Data can be copied between them Both host (CPU) and device (GPU) manage their own memory, host memory and device memory Data can be copied between them

12. CUDA Programming Model 12 The GPU should not be regarded as a �general purpose� computer. Instead it appears to be best to leave some (low overhead) operations to the CPU of the host computer. A function compiled for the device is called a kernel. The kernel is executed on the device as many different threads. The GPU should not be regarded as a �general purpose� computer. Instead it appears to be best to leave some (low overhead) operations to the CPU of the host computer. A function compiled for the device is called a kernel. The kernel is executed on the device as many different threads.

13. 13

14. Stop Thinking About What to Do and Start Doing It! Memory transfer time expensive. Computation is cheap. No longer calculate and store in memory Just recalculates Built-in variables threadIdx blockIdx gridDim blockDim 14 Stop Thinking About What to Do and Start Doing It!Stop Thinking About What to Do and Start Doing It!

15. Example: Increment Array Elements 15

16. Example: Matrix Addition 16

17. Example: Matrix Addition 17

18. Parallel Genetic Programming While most GP work is conducted on sequential computers, the following computationally intensive features make it well suited to parallel hardware: Individuals are run on multiple independent training examples. The fitness of each individual could be calculated on independent hardware in parallel. Multiple independent runs of the GP are needed for statistical confidence to the stochastic element of the result. 18 While most GP work is conducted on sequential computers, the algorithm typically shares with other evolutionary computation techniques at least three computationally intensive features, which make it well suited to parallel hardware. While most GP work is conducted on sequential computers, the algorithm typically shares with other evolutionary computation techniques at least three computationally intensive features, which make it well suited to parallel hardware.

19. A Many Threaded CUDA Interpreter for Genetic Programming Running Tree GP on GPU 8692 times faster than PC without GPU Solved 20-bits Multiplexor 220 = 1048576 fitness cases Has never been solved by tree GP before Previously estimated time: more than 4 years GPU has consistently done it in less than an hour Solved 37-bits Multiplexor 237 = 137438953472 fitness cases Has never been attempted before GPU solves it in under a day 19 The GPU is used only for fitness evaluation.The GPU is used only for fitness evaluation.

20. Boolean Multiplexor 20

21. Genetic Programming Parameters for Solving 20 and 37 Multiplexors 21 The GPU is used only for fitness evaluation. No �0� or �1� in terminal set. The GPU is used only for fitness evaluation. No �0� or �1� in terminal set.

22. AND, OR, NAND, NOR 22

23. Evolution of 20-Mux and 37-Mux 23 In descriptive statistics, a quartile is one of three points, that divide a data set into four equal groups, each representing a fourth of the distributed sampled population.In descriptive statistics, a quartile is one of three points, that divide a data set into four equal groups, each representing a fourth of the distributed sampled population.

24. 6-Mux Tree I 24

25. 6-Mux Tree II 25

26. 6-Mux Tree III 26 Three different trees for 6-Mux indicates that GP is a stochastic process so not guaranteed to come to same conclusion. Three different trees for 6-Mux indicates that GP is a stochastic process so not guaranteed to come to same conclusion.

27. Ideal 6-Mux Tree 27

28. Automatically Defined Functions (ADFs) Genetic programming trees often have repeated patterns. Repeated subtrees can be treated as subroutines. ADFs is a methodology to automatically select and implement modularity in GP. This modularity can: Reduce the size of GP tree Improve readability 28

29. Langdon�s CUDA Interpreter with ADFs ADFs slow down the speed 20-Mux taking 9 hours instead of less than an hour 37-Mux taking more than 3 days instead of less than a day Improved ADFs Implementation Previously used one thread per GP program Now using one thread block per GP program Increased level of parallelism Reduced divergence 20-Mux taking 8 to 15 minutes 37-Mux taking 7 to 10 hours 29 Divergence: Multiprocessors are SIMD devices, meaning their inner 8 stream processors execute the same instruction at every time step. Nonetheless alternative and loop structures can be implemented: for example in case of an if instruction where some of the stream processors must follow the then path while the others follow the else path, both execution paths are serialized, stream processors not concerned by an instruction are simply put into an idle mode. Putting some stream processors in an idle mode to allow control structures is called divergence, and of course this causes a loss of efficiency. At the level of multiprocessors the G80 GPU works in SPMD mode (Single Program, Multiple Data): every multiprocessor must run the same program, but they do not need to execute the same instruction at the same time step (as opposed to their internal stream processors), because each of them owns its private program counter. So there is no divergence between multiprocessors. [Genetic programming on GPUs - Denis Robilliard] Divergence: Multiprocessors are SIMD devices, meaning their inner 8 stream processors execute the same instruction at every time step. Nonetheless alternative and loop structures can be implemented: for example in case of an if instruction where some of the stream processors must follow the then path while the others follow the else path, both execution paths are serialized, stream processors not concerned by an instruction are simply put into an idle mode. Putting some stream processors in an idle mode to allow control structures is called divergence, and of course this causes a loss of efficiency. At the level of multiprocessors the G80 GPU works in SPMD mode (Single Program, Multiple Data): every multiprocessor must run the same program, but they do not need to execute the same instruction at the same time step (as opposed to their internal stream processors), because each of them owns its private program counter. So there is no divergence between multiprocessors. [Genetic programming on GPUs - Denis Robilliard]

30. ThreadGP Scheme Every GP program is interpreted by its own thread. All fitness cases for a program evaluation are computed on the same stream processor As several threads interpreting different programs are run on each multiprocessor, a higher level of divergence may be expected than with the BlockGP scheme. 30

31. BlockGP Scheme Every GP program is interpreted by all threads running on a given multiprocessor. No divergence due to differences between GP programs, since multiprocessors are independent. However divergence can still occur between stream processors on the same multiprocessor, when: an if structure resolves into the execution of different branches within the set of fitness cases that are processed in parallel. 31

32. 6-Mux with ADF 32



35. Conclusion 1: GP Powerful machine learning algorithm Capable of searching through trillions of states to find the solution Often have repeated patterns and can be compacted by ADFs But computationally expensive 35

36. Conclusion 2: GPU Computationally fast Relative low cost Need new programming paradigm, which is practical. Accelerates processing speed up to 3000 times for computationally intensive problems. But not well suited for memory intensive problems. 36

37. Acknowledgement Dr Will Browne and Dr Mengjie Zhang for Supervision. Kevin Buckley for Technical Support. Eric for helping in CUDA compilation. Victoria University of Wellington for Awarding �Victoria PhD Scholarship�. All of You for Coming. 37

38. 38

Genetic Programming on General Purpose Graphics Processing Units GPGPGPU

Genetic Programming on General Purpose Graphics Processing Units GPGPGPU

Presentation Transcript

Graphics Programming

Modified from: A Survey of General-Purpose Computation on Graphics Hardware

General Purpose

High-throughput sequence alignment using Graphics Processing Units

Memory Optimizations for Graphics Processing Units

General Purpose Computation on Graphics Processing Units (GPGPU)

Graphics Processing Units ( GPUs )

General Purpose Graphics Processing Units (GPGPUs)

Graphics Processing Unit (GPU) Architecture and Programming

Graphics Processing Unit (GPU) Architecture and Programming

Accelerating Coherent Pulsar De-dispersion on Graphics Processing Units

General-Purpose Computation on Graphics Hardware

Session 1: GPU: Graphics Processing Units

Genetic Programming Applied to Natural Language Processing

Graphics Processing Unit (GPU) Architecture and Programming

Books on Genetic Programming

Graphics Processing Units

GPSS PROGRAMMING (GENERAL PURPOSE SYSTEM SIMULATION)

GPGPU General Purpose Programmability on Graphic Processors Units

Genetic Programming for Image Processing

General-Purpose Computation on Graphics Hardware

CSE 690 General-Purpose Computation on Graphics Hardware (GPGPU)