slide1 n.
Skip this Video
Loading SlideShow in 5 Seconds..
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems Vignesh Ravi and Gagan Agrawal PowerPoint Presentation
Download Presentation
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems Vignesh Ravi and Gagan Agrawal

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems Vignesh Ravi and Gagan Agrawal

150 Vues Download Presentation
Télécharger la présentation

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems Vignesh Ravi and Gagan Agrawal

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. A Dynamic Scheduling Framework for Emerging Heterogeneous Systems VigneshRavi and GaganAgrawal Department of Computer Science and Engineering The Ohio State University Columbus, Ohio - 43210

  2. Motivation • Today Heterogeneous architectures are very common • Eg., Today’s desktops & notebooks • Multi-core CPU + Graphics card on PCI-E, AMD APUs … • 3 of the top 5 Supercomputers are heterogeneous (as of Nov 2011) • Use Multi-core CPUs and GPU (C2050) on each node • Application development for multi-core CPU and GPU usage is still independent • Resources may be under-utilized • Can Multi-core CPU and GPU be used simultaneously for a single computation

  3. Outline • Challenges Involved • Analyzing Architectural Tradeoffs and Communication Patterns • Cost Model for choosing Chunk Size • Optimized Dynamic Work Distribution Schemes • Experimental Results • Conclusions

  4. Challenges Involved • CPU / GPU vary in compute power, memory sizes, and latencies • CPU / GPU relative performance varies across • Each application • Every combination of CPU and GPU • Different problem sizes • Effective work distribution is critical for performance • Manual or Static distribution is extremely cumbersome • Dynamic distribution schemes are essential • Consider tradeoffs due to heterogeneity in CPU and GPU • Adaptable to varying CPU/GPU performance across each application • Adaptable to different problem sizes and combination of CPU and GPU

  5. Contributions • A general dynamic scheduling framework for data parallel loops on heterogeneous CPU/GPU systems • Critical factors on architectural tradeoffs and communication patterns • Identify “Chunk size” as key factor for performance • Developed Cost Model for heterogeneous systems • Derived two optimized Dynamic Scheduling Schemes • Non-Uniform-Chunk Distribution Scheme (NUCS) • Two-Level Hybrid Distribution Scheme • Using four applications representing two distinct communication patterns, we show: • A 35-75% performance improvement using our dynamic schemes • Within 7% of performance obtained from best static distribution

  6. Analysis of Architectural Tradeoffs and Communication Patterns

  7. CPU – GPU Architectural Tradeoffs Important observations based on architecture • Each CPU core is slower than a GPU • GPU have smaller memory capacity than the CPU • GPU memory latency is very high • CPU memory latency is relatively much small Required Optimizations • Minimize GPU memory transfer overheads • Minimize number of GPU Kernel invocations • Reduce potential resource idle time

  8. Analysis of Communication Patterns • We analyze communication patterns in data parallel loops • Divide input dataset into large number of chunklets • Chunklets can be scheduled in arbitrary order (data parallel) • Processing each element involve local and/or global updates • Global updates involve only associative/commutative operations • Global updates avoid races by privatizing global elements • Global elements may be shared by all or subset of processing elements • In this work, we consider two distinct communication patterns: • Generalized Reduction Structure • Structured Grids (Stencil Computations)

  9. Generalized Reduction Computations {* Outer sequential loop*} While(unfinished) { {*Reduction loop*} Foreach( element e in chunklet) { (i, val) = compute(e) RObj(i) = Reduc(Robj(i), val) } } • Similar to Map-Reduce model • But only one stage, Reduction • Reduction Object, Robj, exposed to programmer • Reduction Object is a shared memory [Race conditions] • Reduction operation, Reduc, is associative or commutative • All updates are Global updates • Global elements are shared by all processing elements Reduction Object Shared Memory Comm./Assoc. operation

  10. Structured Grid Computations For i = 1 to num_rows_chunklet { For j = 1 to y { if( is local row(i) ) { /*local update*/ B[i,j] += C0 * A[i,j]; B[i+1,j] += C0 * A[i,j]; B[i-1,j] += C0 * A[i,j]; B[i,j+1] += C0 * A[i,j]; B[i,j-1] += C0 * A[i,j]; } Else /*global update*/ Reduc(offset) = Reduc(offset) op A[i,j]; } } Rewriting Stencil Kernel as Reduction For i = 1 to num_rows_chunklet { For j = 1 to y-1 { B[i,j] = C0 * (A[i,j] + A[i+1,j] + A[i-1,j] + A[i,j+1] + A[i,j-1]) } } Example: 2-D, 5-point Stencil Kernel • Rewrite as reduction and maintain correctness • Processing involve both local and global updates • Global elements are shared by only subset of processing elements • Stencil kernels are instances of structured grids • Involves nearest neighbor computations • Input partitioned along rows for parallelization

  11. Basic Distribution Scheme & Optimization Goals Uniform Distribution Scheme Fast Workers • Global Work Queue • Idle processor consumes work from the queue • FCFS policy • Fast worker ends up processing more than slow worker • Slow worker still processes reasonable portion of data Worker 1 Worker n Master/Job Scheduler Optimization Goals Worker 1 Ensure sufficient number of chunks Minimize GPU data transfer and kernel invocation overhead Minimize number of global elements allocation Minimize number of distinct process that share the global a global element Worker n Slow Workers

  12. Cost Model for Choosing Chunk Size

  13. Chunk Size: A Key Factor Chunk Size impacts two important factors that directly impact performance • GPU Kernel Invocation and Data Transfer cost • Resource Idle time due to heterogeneous processing elements

  14. Cost Model for Choosing Chunk Size GPU Kernel Call & Transfer Overheads (GKT) Idle Time • Happens at the last iteration of the processing • Slower processor takes more time, while faster processor is idle • GPU being a fast processor will be idle at the end • Each data transfer has 3 factors: • Latency, transfer cost, & Kernel Invocation. • 1st and 3rd factors are dependent on number of chunks (chunk size) • 2nd factor is constant for the entire dataset size We show that: Chunk-size is proportional to the square root of the total processing time Goal: Minimize(Sum(Idle time, GKT)) for the entire processing

  15. Optimized Dynamic Distribution Schemes

  16. Non-Uniform Chunk Size Scheme Initial Data Division • Start with initial chunk size as indicated by the cost model • If CPU requests, data with initial size is forwarded • If GPU requests, a larger chunk is formed by merging smaller chunks • Minimizes GPU data transfer and device invocation overhead • At the end of processing, idle time is also minimized GPU workers Chunk 1 Work Dist. System Chunk 2 Job Scheduler … Merging Large Chunk … Chunk K CPU workers Small Chunk

  17. Two-Level Hybrid Scheme CPU GPU Chunk Chunk • In the first level data between CPU and GPU is dynamically distributed • Allows coarse-grained distribution of data • Reduces the number of global updates Dynamic GPU chunk • In the second level, static and equal distribution within CPU cores and GPU cores • Reduces number of subsets that share global elements (P^2  P-1) • Reduces Combination overhead Thread 0 Thread k CPU chunk Thread p-1

  18. Experimental Results

  19. Experimental Setup Applications for Generalized Reduction Structure K-Means Clustering [6.4 GB] Principal Component Analysis (PCA) [8.5 GB] Applications for Structured Grid Computations 2-D Jacobi Kernel [1 GB] Sobel Filter [1 GB] • Environment 1 (CPU-centric) • AMD Opteron • 8 CPU cores • 16 GB Main Memory • NvidiaGeForce 9800 GTX • 512 MB Device Memory • Environment 2 (GPU-centric) • AMD Opteron • 4 CPU cores • 8 GB Main Memory • Nvidia Tesla C1060 • 4GB Device Memory September 16, 2014 19

  20. Experimental Goals Validate the accuracy of cost model for choosing chunk size Evaluate the performance gain from using optimized work distribution schemes (using CPU+GPU simultaneously) Study the overheads of dynamic distribution compared to the best static distribution September 16, 2014 20

  21. Accuracy of the Cost Model for Choosing Chunk Size Predicted • Each application achieves best performance at different chunk sizes • Poor chunk size selection can impact the performance significantly • “Predicted” chunk size is always close to the chunk size with best performance 15 52 78 140

  22. Performance Gains from Using CPU&GPU ENV 2 ENV 1 37% 75% • For K-means and PCA, CPU+GPU version uses NUCS • For Jacobi and Sobel, CPU+GPU version uses 2-Level Hybrid Scheme • In both ENV1 & ENV2, performance improvements ranging from 37% to 75% can be achieved • Shows that our dynamic scheduling framework can adapt to different hardware configurations of CPU and GPU

  23. Scheduling Overheads of Dynamic Distribution Schemes K-Means 7% • We compare Dynamic schemes with static schemes • “Naïve” static – Distributes work equally between CPU and GPU • “M-OPT” static – obtained from exhaustive search for every problem size, application and h/w config. • Dynamic Schemes: At most 7% slower than the “M-OPT” static • Significantly better than “Naïve” Sobel Filter

  24. Conclusions • We present a dynamic scheduling framework for data parallel loops on heterogeneous systems • Analyze architectural and communication pattern tradeoffs to infer critical constraints for dynamic scheduling • A cost model for choosing optimal chunk size in a heterogeneous setup • Developed two instances of optimized work distribution schemes • NUCS & 2-Level Hybrid scheme • Our evaluation include 4 applications representing 2 distinct communication patterns • We show up to 75% improvement from using CPU& GPU simultaneously

  25. Thank You! Questions? Contacts: Vignesh Ravi - Gagan Agrawal -