burton
Uploaded by
9 SLIDES
235 VUES
90LIKES

Enhancing GPU Performance with Dynamic Task Parallelism and Work-Stealing Runtime System

DESCRIPTION

This paper explores the potential of GPUs for heterogeneous computing through a dynamic task parallelism approach. While NVIDIA’s CUDA facilitates data-parallel programming, it can be complex for average users. We introduce a hybrid work-stealing work-sharing runtime system that optimizes task execution across multiple streaming multiprocessors (SMs) and manages multiple CUDA devices seamlessly. Performance results, including a noteworthy 9.8x imbalance for static subtree assignment versus only 1.17x for dynamic work-stealing, illustrate the effectiveness of our method. Future work aims to fully integrate this runtime with the CnC-HC data flow coordination language.

1 / 9

Télécharger la présentation

Enhancing GPU Performance with Dynamic Task Parallelism and Work-Stealing Runtime System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript

Playing audio...

  1. Dynamic Task Parallelism with a GPU Work-Stealing Runtime System Max Grossman Advisor: Dr. VivekSarkar Rice University

  2. Background • GPU is a promising example of heterogeneous hardware • Hundreds of simultaneous threads • High memory bandwidth • NVIDIA’s CUDA makes general purpose programming on GPUs possible, but not easy for the average programmer

  3. Co Ca Control ALU ALU ALU ALU DRAM Cache A A A A A A A A A A A A A A A A Streaming Multiprocessor DRAM CPUs and GPUs have fundamentally different design philosophies Single CPU core Multiple GPU processors • Figure source: David B. Kirk and Wen-meiW. Hwu. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edition, 2010.

  4. Motivation & Approach • CUDA programming model launches a batch of data-parallel threads • Can we do better with dynamic task parallelism? • Our approach • Manage task execution across multiple streaming multiprocessors (SMs) in a GPU device by introducing a hybrid work-stealing/work-sharing runtime system • Manage multiple CUDA devices for the user • Hide device memory allocation and communication from user

  5. Load Balance Results • NQueens(14) • Worst case load imbalance for static subtree assignment vs. dynamic work-stealing are 9.8x vs. 1.17x

  6. Performance Results: NQueens

  7. Performance Results: Crypt

  8. Conclusions • GPU work stealing runtime which supports dynamic task parallelism, on hardware intended for data parallelism • Showed effectiveness of work stealing queues in dynamically distributing work between SMs • Future work: • Fully integrate this runtime with the CnC-HC data flow coordination language being developed at Rice University

More Related