Future of GPU/CPU Computing and Programming

Future of GPU/CPU Computing and Programming CK July 16, 2012

Key Note • Original Purpose and History of GPU • Why Develop Programs for GPU • Architecture Difference Between CPU and GPU • When to Develop Programs for GPU • Side-Step • Future of CPU / GPU computing • CPU / GPU Computing Hurdles • Hardware • Parallel Programming Concepts • CPU versus GPU Battle • Complexity of Current GPU program APIs & Architecture • GPU Program Portability • Hybrid/Heterogeneous Computing • Future Computing Challenges Tree • Possible Future & Solutions Summary Table • Breakdown of Parallelism • Conclusion • References GPU Background Overview

This presentation talks about GPU and CPU computing and programming • Kernel programming controls the GPU and CPU. • Hardware Abstraction - Typically with higher level languages this kernel programming is taken care of by the language developers and hidden from the view of high-level programming programmers Key Note Hardware Programmers & Program Language Developers High Level Language Programmers Hardware Components End Software Customer Mathematicians

Original Purpose and History of GPUs • GPU = Graphical Processing Unit • Initially designed to accelerate memory-intensive work of texture mapping and rendering polygons which would then be displayed on the user’s Computer Screen. [1][2] • Modern GPUs use most of their transistors to do calculations related to computer graphics. [1] • Just back in 2006 Nvidia released Cuda 1.0 which allowed programmers access to GPU computing capabilities [20] • This evolution in GPU has continued to add flexibility to GPU usage. With this new, now somewhat easy to access computing capability, many engineers and scientists are starting to look into the using the GPU for non-graphical calculations. Texture Mapping [3]

Why Develop Programs for GPUs • Speed (probably the sole reason) • As graphics, animation and GUI interfaces become an everyday occurrence in software … the software becomes more and more compute intensiveThis makes the user experience slow and arduous. • R&D becomes more and more compute intensive • In many machines, the GPU sits idle while the CPU does all the work • GPUs are more efficient then CPUs in certain processes and program which can take advantage of parallel programming. • Once GPU programming languages came along people began to offload work they once forced the CPU process over to the GPU. • CAUTION:Exact Speed Difference: Comparing Apples and Oranges • People (companies) have gone to extreme measure to determine which is better and faster …. the GPU or the CPU • Unfortunately this is a very unfair comparison due to the fact that they each have different purposes • CPU: Much broader useachieve good performance on a wide variety of workloadsCPU cores (things you run a thread on) are much faster than GPU cores [6] • GPU:very specific use so can maximize architecture for that one usehas dozens of cores compared to the CPUs 4-8 cores • Processes ideal for GPU have been measured to run from only 2.5x (Intel) faster to 100x (Nvidia) faster [4][5]

Architecture Difference Between CPU and GPU Take Away • GPU is a supplement, NOT a replacement, for the CPU • Our goal as programmers should be to: • Make wise decisions as to when to take advantage of the GPU power • Help CPU & GPU work together as efficiently as possible

When to Develop Programs for GPUs • Converting a program to take advantage of the GPU is not a simple or cheap task.Therefore need to determine which code would be most efficient on CPU and which would be more efficient if processed by the GPU. • Graphics Rendering • Problems expressed as data-parallel computations with high arithmetic intensity (a high ratio of arithmetic operations to memory operations) [7] • Computationally Intensive Task, ideal for GPU processing • Many scientific computing problems • Engineering computing problems • Simple structured grip PDR methods in computational finance • Physical simulations • Matrix Algebra • Image & Volume processing • Global Illumination • Ray Tracing, photon mapping, radiosity • Non-grid streams (which can be mapped to grids • XML parsing • Medical Imaging • Photography • Grid Computing

Side-Step Task-Parallel Versus Data-Parallel Data-Parallelism (loop-level parallelism) (SIMD) Distributing the data across different parallel computing nodes. Perform the same task on different pieces of distributed data. Task (function/Control) Parallelism Each processor executes a different thread (or process) on the same or different data. The threads can be the same or different code. [8]

Who is Already Developing Programs for GPUs GPUs finding their way into the following fields • Database • Oil Exploration • Web Search Engines • Medical Imaging • Pharmaceutical design • Financial Modeling • Advanced Graphics • Networked Video tech • Collaborative Work Environments [9] [10]

Future of CPU / GPU Computing • Heterogeneous/hybrid Computing • Tasks split between GPU and CPU • Parallel CPU/GPU Processing will become a norm in all program [11] Do We Really Need to Switch to Heterogeneous Computing? • Previously (90 early 2000), hardware technology advances allowed increase in performance without the immediate need for change or fundamental restructuring. • Hardware is starting to hit a quantum-wall and a thermal/power-wall. Need to spread tasks out over several processors. • Different processor architectures excel in different areas. Why make one architecture style do everything? • Currently there is a lot of wasted processor time. CPU sits idle while GPU does it’s task. The GPU sits idle while the CPU burns itself out trying to do almost everything [11] • In the end … GPU provide low cost platform for accelerating high performance computations. [13]

The ‘Future Computing Challenges’ Tree

CPU / GPU Computing Hurdles • Hardware: GPU to CPU data transfer bottleneck • the limitation with the heterogeneous computation model is the significant overhead of memory transfers between the host CPU and the GPU [12] • Parallel Programing Concepts • Multi-processor chip hardware requires dauntingly complex software that breaks up computing chores into simultaneously processed chucks of code. [21] • CPU versus GPU battle • Complexity of current GPU programming languages [13] • Fairly complex and error prone at times • Optimizing an algorithm for a specific GPU is a time-consuming task which currently requires thorough knowledge of both the algorithm as well as the hardware [13] • Programmers should not have to concern themselves with intricate details of the hardware. • Portability of current GPU programming languages [13] • GPU code lacks portability due to the fact that code for one GPU may not run as efficiently (or at all) when run on non-native GPU hardware. • Much of GPU coding is not even capable of being efficiently ported over to different generations and/or model of the same GPU brands • There is also a desire for GPU code to be able to fall back, and run on CPUs if a GPU is not available … this feature is only seen in a very few GPU APIs • Complexity of Hybrid optimization • Entire thesis done on CPU/GPU communication optimization.

Conquering CPU / GPU Computing Hurdles Hardware [18] • CPU & GPU Hardware ConstraintsMoore’s Law Continues & Heisenberg Uncertainty Principle Altered • Feb 2012 Physicists created a working transistor (transistors = things that holds bits making memory and information storage possible) consisting of a single atom [15] • After single-atom transistors next will be photo transistors. Replace traces on circuit boards with optical signals • In 2010 IMB and Intel joined forces, investing $4.4 billion in chip technology [19] • GPU to CPU data transfer bottleneck (Hardware)Optical guides (IBM and Intel) • the limitation with the heterogeneous computation model is the significant overhead of memory transfers between the host CPU and the GPU [12] • Both IBM and Intel are investing money and time into photon data transfer technologies [17][18] • Plan to replace copper cables and backplane. Photon data transfer significantly reduces CPU to GPU communication & bring transfer rates down to hopefully negligible times.

Conquering CPU / GPU Computing Hurdles Parallel Programming Concepts parallel program design • Currently programs written for an architecture with n processors require a re-write when migrated to an m processor architecture to benefit from additional resources. [22] • Compiler based parallelization techniques try to automatically find and use partial orders in sequential code but often fail to match manual optimization. • Where various techniques fall short • POSIX – requires programmer to specify the partial order between program operations in terms of constructs such as threads, locks and semaphores • OpenMP – requires programmer to specify code which they believe would perform better via parallel processing. • OpenCL and CUDA – require user knowledge of the computational platform learn the libraries and how to implement them • Solution – Automatic and Portable Parallel Programming • TripleP – uses synthesis at compile time to generate parallel binaries from declarative programs. It abstracts the execution order of the program away from the developer and allows for explicitly parallelism without requiring architecture specific annotations and structures (determines best way to parallel the code) [22] • DARPA challenges companies/institutes to develop new parallel languages and programming tools back in 2001. [23] • PPmodel – helps separate out sequential and parallel parts of program into blocks without modifying code. Also supports CUDA. (identifies hotspots) [24] • MARPLE – help businesses to automatically migrate their legacy software systems to a data-parallel platform like the Nvidia CUDA GPU [25] Software design Software Developers having to think about parallel breakdown of program Software Developers Language Developers parallel program

Conquering CPU / GPU Computing Hurdles CPU versus GPU Battle • Market demands as well as global demands will encourage the progress of technology. • Mergers (AMD & ATI 2006) • Partially non-bias Middle Person • vendors such as IBM, dell, HP realize need GPU and CPU. Help facilitate creation of heterogeneous system. • Government • Nvidia and Intel DARPA in exascale computing project in 2010 [30] • Nvidia, Intel, AMD, Whamcloud work with Department of Energy on FastForwardexascale computing program Jul 2012. [26] • Truly heterogeneous machines may be achievable without intimate relationship & sharing of proprietary information between CPU and GPU companies • Conclusion: • Should not expect or hope for the separate companies to play ‘friendly’. Will always have lawsuits and fighting. Main concern for us is that their bickering does not infringe on overall progress of computing technology, but instead encourages growth. • No one disputes the need for heterogeneous computing. Disputes over who should do what.

GPU Hardware and Code Learning Conquering CPU / GPU Computing Hurdles Complexity of Current GPU Program APIs & Architecture ??? • Solutions to: Fairly complex and error prone due to parallel programming. • Improve ease of parallel programming (See parallel programming solutions) • Program readability still needs work. More difficult for humans to conceptualize since more natural to think in series • Work on creating a higher level programming abstraction similar to stream programming model [13] • Far from max efficiency when programming in object oriented programming languages (C++ good… Java and everything else not as close to max efficiency) • Okay, that will be about a two year • wait; we have to learn the latest GPU • hardware and libraries and write • Code for that specific GPU • (which you must also purchase along • with our software). When you upgrade • hardware must update software to take • maximum advantage of hardware I need a fast Structural analysis tool

Conquering CPU / GPU Computing Hurdles GPU Program Portability(between GPU brands as well as GPU or CPU) • OpenCL (Khronos … initially Apple 2008) [28] • Khronos - ATI Technologies, Discreet, Evans & Sutherland, Intel Corporation, NVIDIA, Silicon Graphics (SGI), and Sun Microsystems. Today the Khronos Group has roughly 100 member companies, over 30 adopters, and twenty-four conforming members • Can be implemented on number of platforms (including cell phones) • When GPU hardware is not present it can fall back on CPU to perform the specified work * [28] • Supports synchronization over multiple devices • Easy to learn • Open standard & Collaborative Effort • Share resources with OpenGL • GPUs: Nvidia, ATI & Ivy Bridge & others • DirectCompute(2009? – Microsoft) • C++ AMP … builds on DirectCompute (2011 – Microsoft) • GPUs: Nvidia & ATI

Conquering CPU / GPU Computing Hurdles GPU Program Portability(between GPU brands as well as GPU or CPU) • Sponge: a compilation framework for Nvidia GPUs using synchronous data flow streaming languages. • abstraction of hardware details [13] • Creates write-once optimized CUDA code for variety of GPU targets • Takes care of the GPU to host and host to GPU communication • Also determines what of your code (StreamIT program) is better suited for GPU and which is better suited for CPU [13] • Improved performance of 3.2x compared to the GPU baseline benchmarks which come from StreamIT suite

Conquering CPU / GPU Computing Hurdles Hybrid/Heterogeneous Computing • Software that can support Hybrid ComputingOpenCL, C++ AMP • Parallel Analyzers to aid in process distribution amongst CPU/GPU • All software mentioned in the pages above

Conclusion Revisit of the ‘Future Computing Challenges’ Tree Conclusion: Will not be able to get good grade, dependable and reliable software which will survive in this environment (new frontier) until a lot of these challenges have been confronted and complexities somewhat removed

Conclusion Possible Future & Solutions Summary Table NOTE: Solutions in red highlight are also part of the computing challenges

Conclusion Breakdown of parallelism Conclusion: Need to decide where parallelism belongs; How to abstract (for software programmer) the process as much as possible

Conclusion Breakdown of parallelism Task or Project Specific Is this a Data-Parallel Process Is this a task-Parallel Process Is this a Sequential Process START NO NO YES YES Are ‘x-many’ or more threads possible YES NO YES CPU GPU YES Mathematically Oriented Generic & Reusable Are ‘x-many’ or more threads possible NO YES Is it possible to parallel process this algorithm NO START Conclusion: Need better models, guidelines and programs to help determine where (processor) and how processes run most efficiently

It would be in our best interest to peruse hybrid computing in order up with market demands. • The future of Research Depends Heavily on Computing Power: • Space: predicting the future of the planet and the solar system and universe • Medical: Techniques to find cures to cancer and other diseases are being taken out of lab and designed into the computer software • Environmental: Collecting data on environmental and weather patterns and create a more eco-compatible human habitats • Science: aid in solving complex mathematical computations to make further strides in scientific discoveries • Information overload needs to be dealt with [31] • Increase available space for information • Increase focus on massive organization of information Conclusion

[1] http://en.wikipedia.org/wiki/Graphics_processing_unit • [2] http://en.wikipedia.org/wiki/Texture_mapping • [3] http://www.siggraph.org/education/materials/HyperGraph/mapping/r_wolfe/r_wolfe_mapping_1.htm • [4] p451-lee.pdf only 2.5x (Intel) faster • [5] http://blogs.nvidia.com/2010/06/gpus-are-only-up-to-14-times-faster-than-cpus-says-intel/ • [6] http://stackoverflow.com/questions/28147/feasability-of-gpu-as-a-cpu • [7] http://wiki.accelereyes.com/wiki/index.php/Introduction_to_GPU_Computing • [8] http://software.intel.com/en-us/articles/choose-the-right-threading-model-task-parallel-or-data-parallel-threading/ • [9] the_future_of_Massively_parallel_and_GPU_Computing (pdf) • [10] https://computing.llnl.gov/tutorials/parallel_comp/ • [11] interact-16-paper-5.pdf • [12] http://wiki.accelereyes.com/wiki/index.php/Introduction_to_GPU_Computing • [13] Sponge_Portable_Stream_Programming_on_Graphics_Engines.pdf • [14] http://www.nature.com/nphys/journal/vaop/ncurrent/full/nphys1734.html • The uncertainty principle in the presence of quantum memory (Nature Physics) • [15] http://www.sciencedaily.com/releases/2012/02/120219191244.htm • [16] http://www.sciencedaily.com/releases/2007/08/070826162731.htm • [17] http://www.intel.com/pressroom/archive/releases/2010/20100727comp_sm.htm • [18] ibm+opcb+roadmap+and+tech+-+jeff+kash.pdf • [19] http://news.cnet.com/8301-13924_3-20112553-64/ibm-intel-group-to-invest-$4.4-billion-in-chip-tech/ • [20] http://www.youtube.com/watch?v=Cmh1EHXjJsk • [21] ManyCore121707.pdf • [22] p1922-zaraket.pdf • [23] http://www.economist.com/node/18750706 • [24] p138-jacob.pdf • [25] p131-sarkar.pdf • [26] http://www.theverge.com/2012/7/14/3157985/nvidia-intel-amd-department-of-energy-fastforward • [27] http://www.digitaltrends.com/computing/how-nvidias-kepler-chips-could-end-pcs-and-tablets-as-we-know-them/ • [28] 0112acij09.pdf • [29] p91-song.pdf • [30] http://www.informationweek.com/news/government/enterprise-architecture/226700040 ReferenceFully Referenced In Report

Future of GPU/CPU Computing and Programming

Future of GPU/CPU Computing and Programming

Presentation Transcript

Introduction to High Performance Computing: Parallel Computing, Distributed Computing, Grid Computing and More

Integer Programming, Goal Programming, and Nonlinear Programming

C Programming

雲端計算 Cloud Computing

Molecular Programming

Computational Physics An Introduction to High-Performance Computing

Optical Computing

The Future of Parallel Computing

Introduction to PGAS (UPC and CAF) and Hybrid for Multicore Programming

Soft Computing

Introduction to JDBC Programming

Denis.Caromel@inria.fr

Concurrent Programming (part of CSc213/4)

Mobile Computing – A Distributed Systems Perspective

Visions of the Future of Computing

Java and Java Computing

Introduction to Parallel Computing

C Programming

CNT 4714: Enterprise Computing Fall 2014 Programming Multithreaded Applications in Java

Introduction to JDBC Programming

MT311 Java Programming and Programming Languages