Programming Models, Languages, and Compilation for Accelerator-Based Architectures

Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in Workshop on HPC in India ATIP 1st Workshop on HPC in India @ SC-09

Current Trend in HPC Systems Top500 systems have hundreds of thousand (100,000s) cores Large HPCs. Performance scaling major challenge No. of cores in a processor/node is increasing! 4 – 6 cores per processor, 16-24 cores/node! Parallelism even at the node level Top systems use accelerators GPUs and CellBEs 1000s of cores/proc. Elements in a single GPU! ATIP 1st Workshop on HPC in India @ SC-09

HPC Design Using Accelerators High level of performance from Accelerators Variety of general-purpose hardware accelerators GPUs : nVidia, ATI, Accelerators: Clearspeed, Cell BE, … Plethora of Instruction Sets even for SIMD Programmable accelerators, e.g., FPGA-based HPC Design using Accelerators Exploit instruction-level parallelism Exploit data-level parallelism onSIMD units Exploit thread-level parallelism onmultiple units/multi-cores Challenges Portability across different generation and platforms Ability to exploit different types of parallelism ATIP 1st Workshop on HPC in India @ SC-09

Accelerators – Cell BE ATIP 1st Workshop on HPC in India @ SC-09

Accelerators - 8800 GPU ATIP 1st Workshop on HPC in India @ SC-09

The Challenge OpenCL AMD CAL CUDA ArmNeon SSE AltiVec ATIP 1st Workshop on HPC in India @ SC-09

Programming in Accelerator-Based Architectures Develop a framework Programmed in a higher-level language, and is efficient Can exploit different types of parallelism on different hardware Parallelism across heterogeneous functional units Be portable across platforms – not device specific! ATIP 1st Workshop on HPC in India @ SC-09

Existing Approaches CUDA/ OpenCL Brook C/C++ Brook Compiler Auto vectorizer Compiler nvCC/JIT ATI CAL IL PTX/ATI CAL IL SSE/ Altivec CPU CPU GPUs CPU GPUs ATIP 1st Workshop on HPC in India @ SC-09

Existing Approaches (contd.) Accelerator OpenMP StreaMIT Std. Compiler Std. Compiler StreamIT Compiler CPU CPU CellBE RAW Runtime GPUs DirectX GPUs ATIP 1st Workshop on HPC in India @ SC-09

What is needed? Array Lang. (Matlab) OpenMP MPI Parallel Lang. Streaming Lang. CUDA/ OpenCL Compiler/ Runtime System Synergistic Execution on Multiple Hetergeneous Cores Multicores Other Aceel. SSE CellBE GPUs ATIP 1st Workshop on HPC in India @ SC-09

What is needed? Array Lang. (Matlab) OpenMP MPI Parallel Lang. Streaming Lang. CUDA/ OpenCL Synergistic Execution on Multiple Hetergeneous Cores Multicores Other Aceel. SSE CellBE GPUs Compiler PLASMA: High-Level IR Runtime System ATIP 1st Workshop on HPC in India @ SC-09

Stream Programming Model Higher level programming model where nodes represent computation and channels communication (producer/consumer relation) between them. Exposes Pipelined parallelism and Task-level parallelism Temporal streaming of data Synchronous Data Flow (SDF), Stream Flow Graph, StreamMIT, Brook, … Compiling techniques for achieving rate-optimal, buffer-optimal, software-pipelined schedules Mapping applications to Accelerators such as GPUs and Cell BE. ATIP 1st Workshop on HPC in India @ SC-09

Streamit programs are a hierarchical composition of three basic constructs: Pipeline SplitJoin Round-robin or duplicate splitter Feedback Loop Stateful filters Peek values The StreamIt Language ... Filter Filter Filter Stream Joiner Splitter Stream Loop Joiner Body Splitter ATIP 1st Workshop on HPC in India @ SC-09

More ”natural” than frameworks like CUDA or CTM Easier learning curve than CUDA No need to think of ”threads” or blocks, StreamIt programs are easier to verify, Schedule can be determined statically. Why StreamIt on GPUs ATIP 1st Workshop on HPC in India @ SC-09

Work distribution across multiprocessors GPUs have hundreds of processing pipes! Exploit task-level and data-level parallelism Schedule across the multiprocessors Multiple concurrent threads in SM to exploit DLP Execution configuration: task granularity and concurrency Lack of synchronization between the processors of the GPU. Managing CPU-GPU memory bandwidth Issues on Mapping StreamIt for GPUs ATIP 1st Workshop on HPC in India @ SC-09

Stream Graph Execution SM1 SM2 SM3 SM4 A 0 A1 A2 Task Parallelism 1 2 3 B1 B2 B C 4 A3 A4 C1 C2 5 6 D1 D2 7 B3 B4 D C3 C4 Pipeline Parallelism D3 D4 Data Parallelism Stream Graph Software Pipelined Execution ATIP 1st Workshop on HPC in India @ SC-09

Our Approach Our Approach for GPUs • Code for SAXPY float->float filter saxpy { float a = 2.5f; work pop 2 push 1 { float x = pop(); float y = pop(); float s = a * x + y; push(s); } } ATIP 1st Workshop on HPC in India @ SC-09

Multithreading Identify good execution configuration to exploit the right amount of data parallelism Memory Efficient buffer layout scheme to ensure all accesses to GPU memory are coalesced. Task Partition between GPU and CPU cores Work scheduling and processor (SM) assignment problem. Takes into account communication bandwidth restrictions Our Approach (contd.) ATIP 1st Workshop on HPC in India @ SC-09

Execution Configuration A0 A1 A127 B0 B1 B127 B0 B1 B127 More threads for exploiting data-level parallelism Exec. Time of Macro Node = 32 Exec. Time of Macro Node = 16 Total Exec. Time on 2 SMs = MII = 64/2 = 32 ATIP 1st Workshop on HPC in India @ SC-09

Coalesced Memory Accessing B0 B1 B2 B3 B0 B1 B2 B3 B0 B1 B2 B3 B0 B1 B2 B3 d0 d7 d0 d7 d1 d2 d3 d4 d5 d6 d2 d4 d6 d1 d3 d5 thread0 thread2 thread0 thread2 thread1 thread3 thread1 thread3 • GPUs have a banked memory architecture with a very wide memory channel • Accesses by threads in an SM have to be coalesced ATIP 1st Workshop on HPC in India @ SC-09

Execution on CPU and GPU • Problem: Partition work across CPU and GPU • Data transfer between GPU and Host memory required based on the partition! • Coalesced access is efficient for GPU, but harmful for CPU!  Transform data before move from/to GPU memory • Reduce the overall execution time, taking into account memory transfer and transform delays! ATIP 1st Workshop on HPC in India @ SC-09

Scheduling and Mapping A GPU:20 CPU:10 GPU:20 A 20 20 B CPU:20 B CPU:20 10 10 C GPU:20 CPU:80 GPU:20 C 10 10 D CPU:15 CPU:15 GPU:10 D 60 E CPU:10 CPU:10 GPU:25 E CPU Load:45 GPU Load:40 DMA Load:40 MII:45 Initial StreamIt Graph Partitioned Graph ATIP 1st Workshop on HPC in India @ SC-09

Scheduling and Mapping A GPU:20 20 B CPU:20 10 Bn-1 An-1 C GPU:20 10 D CPU:15 Bn-3 Cn-3 E CPU:10 Dn-5 Cn-5 CPU DMA Channel GPU Bn-2 An Dn-6 Cn-4 En-7 ATIP 1st Workshop on HPC in India @ SC-09

Compiler Framework StreamIt Program Generate Code for Profiling Execute Profile Runs Configuration Selection ILP Partitioner Instance Partitioning Task Partitioning Code Generation Modulo Scheduling CUDA Code + C Code Instance Partitioning Task Partitioning Heuristic Partitioner ATIP 1st Workshop on HPC in India @ SC-09

Significant speedup for synergistic execution Experimental Results on Tesla > 52x > 32x > 65x ATIP 1st Workshop on HPC in India @ SC-09

What is needed? Array Lang. (Matlab) OpenMP MPI Parallel Lang. Streaming Lang. CUDA/ OpenCL Synergistic Execution on Multiple Hetergeneous Cores Multicores Other Aceel. SSE CellBE GPUs Compiler PLASMA: High-Level IR Runtime System ATIP 1st Workshop on HPC in India @ SC-09

Rich abstractions for Functionality Independence from any single architecture Portability without compromises on efficiency Scale-up and scale down Single core embedded processor to multi-core workstation Take advantage of Accelerators (GPU, Cell, …) Transparent Distributed Memory IR: What should a solution provide? • PLASMA: Portable Programming for PLASTIC SIMD Accelerators ATIP 1st Workshop on HPC in India @ SC-09

PLASMA IR Reduce Add Par Mul Slice V M Matrix-Vector Multiply par mul, temp, A[i *n : i *n+n : 1], X reduce add, Y[I : i+1 : 1], temp ATIP 1st Workshop on HPC in India @ SC-09

Our Framework • “CPLASM”, a prototype high-level assembly language • Prototype PLASMA IR Compiler • Currently Supported Targets: • C (Scalar), SSE3, CUDA (NVIDIA GPUs)‏ • Future Targets: • Cell, ATI, ARM Neon, ... • Compiler Optimizations for this “Vector” IR ATIP 1st Workshop on HPC in India @ SC-09

Our Framework (contd.) • Plenty of optimization opportunities! ATIP 1st Workshop on HPC in India @ SC-09

PLASMA IR Performance • Normalized exec. Time comparable to that of hand-tuned library! ATIP 1st Workshop on HPC in India @ SC-09

Ongoing Work Array Lang. (Matlab) OpenMP MPI Parallel Lang. Streaming Lang. CUDA/ OpenCL Compiler PLASMA: High-Level IR Runtime System Synergistic Execution on Multiple Hetergeneous Cores Multicores Other Aceel. SSE CellBE GPUs • Look at other high level languages ! • Target other accelerators ATIP 1st Workshop on HPC in India @ SC-09

Compiling OpenMP/MPI / X10 Mapping the semantics Exploiting data parallelism and task parallelism Communication and synchronization across CPU/GPU/Multiple Nodes Accelerator-specific optimization Memory layout, memory transfer, … Performance and Scaling ATIP 1st Workshop on HPC in India @ SC-09

Thank You !! Acknowledgements • My students! • IISc and SERC • Microsoft and Nvidia • ATIP, NSF, all Sponsors • ONR ATIP 1st Workshop on HPC in India @ SC-09

Programming Models, Languages, and Compilation for Accelerator-Based Architectures

Programming Models, Languages, and Compilation for Accelerator-Based Architectures

Presentation Transcript

Programming Languages

COMPAS: Compliance-driven Models, Languages, and Architectures for Services

Programming and Languages

Parallel Programming Models, Languages and Compilers

Programming Models and Languages for Clusters of Multi-core Nodes Part 3: PGAS Languages

Efficient and Easily Programmable Accelerator Architectures

Performance Debugging for Highly Parallel Accelerator Architectures

Models and Languages for Parallel Computation

Agent based languages and architectures for web service integration

Programming and Languages

Parallel and Distributed Programming Models and Languages

Programming Languages for XML

Programming Models for Multithreaded Architectures: The EARTH Threaded-C Experience

Introduction to Parallel Architectures and Programming Models

Models and languages for semistructured data

Formal Models for Programming Languages

Programming Languages for Biology

COMPAS: Compliance-driven Models, Languages, and Architectures for Services

Efficient and Easily Programmable Accelerator Architectures

Models and languages for semistructured data