Enhancing GPU Performance with Dynamic Task Parallelism and Work-Stealing Runtime System

Dynamic Task Parallelism with a GPU Work-Stealing Runtime System Max Grossman Advisor: Dr. VivekSarkar Rice University

Background • GPU is a promising example of heterogeneous hardware • Hundreds of simultaneous threads • High memory bandwidth • NVIDIA’s CUDA makes general purpose programming on GPUs possible, but not easy for the average programmer

Co Ca Control ALU ALU ALU ALU DRAM Cache A A A A A A A A A A A A A A A A Streaming Multiprocessor DRAM CPUs and GPUs have fundamentally different design philosophies Single CPU core Multiple GPU processors • Figure source: David B. Kirk and Wen-meiW. Hwu. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edition, 2010.

Motivation & Approach • CUDA programming model launches a batch of data-parallel threads • Can we do better with dynamic task parallelism? • Our approach • Manage task execution across multiple streaming multiprocessors (SMs) in a GPU device by introducing a hybrid work-stealing/work-sharing runtime system • Manage multiple CUDA devices for the user • Hide device memory allocation and communication from user

Load Balance Results • NQueens(14) • Worst case load imbalance for static subtree assignment vs. dynamic work-stealing are 9.8x vs. 1.17x

Performance Results: NQueens

Performance Results: Crypt

Conclusions • GPU work stealing runtime which supports dynamic task parallelism, on hardware intended for data parallelism • Showed effectiveness of work stealing queues in dynamically distributing work between SMs • Future work: • Fully integrate this runtime with the CnC-HC data flow coordination language being developed at Rice University

Enhancing GPU Performance with Dynamic Task Parallelism and Work-Stealing Runtime System

Enhancing GPU Performance with Dynamic Task Parallelism and Work-Stealing Runtime System

Presentation Transcript

Parallelism in C++ using the Concurrency Runtime

Taming GPU compute with C++ Accelerated Massive Parallelism

Harnessing GPU compute with C++ Accelerated Massive Parallelism

Task Parallelism and Task Superscalar Processing

Instruction-Level Parallelism Dynamic Scheduling

An Adaptive Task Creation Strategy for Work-Stealing Scheduling

Stealing

Instruction-Level Parallelism dynamic scheduling

Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework

Task Parallelism

Automatically Exploiting Cross-Invocation Parallelism Using Runtime Information

Idempotent Work Stealing

‘Stealing’

Work Stealing Scheduler

A GPU Accelerated Storage System

MuPC: A Runtime System for UPC

Dynamic Task Parallelism with a GPU Work-Stealing Runtime System

GPU System Architecture

Last time: Runtime infrastructure for hybrid (GPU-based) platforms Task scheduling

Task Based Execution of GPU Applications with Dynamic Data Dependencies

Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System

Task Parallelism