1 / 61

Dissertation Defense

GPU DECLARATIVE FRAMEWORK: DEFG. Dissertation Defense. Robert Senser October 29, 2014. PhD Committee:. Gita Alaghband (chair) Tom Altman (advisor) Michael Mannino Boris Stilman Tam Vu. Presentation Outline. Motivation for Work

Télécharger la présentation

Dissertation Defense

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GPU DECLARATIVE FRAMEWORK:DEFG Dissertation Defense Robert Senser October 29, 2014 PhD Committee: Gita Alaghband (chair) Tom Altman (advisor) Michael Mannino Boris Stilman Tam Vu

  2. Presentation Outline • Motivation for Work • Background: Graphical Processing Units (GPUs) and OpenCL • GPU DECLARATIVE FRAMEWORK: DEFG • Diverse GPU Applications using DEFG • Image Filters (Sobel and Median) • Breadth-First Search • Sorting Roughly Sorted Data • Iterative Matrix Inversion • Dissertation Accomplishments • Future Research and Observations

  3. Motivation for Work • GPUs can provide high throughput • Radeon HD 7990: 2 (double-precision) TFLOPS • Developing parallel HPC software is difficult • Parallel development for GPUs is even more difficult • GPU HPC software development requires: • Understanding of unique GPU hardware characteristics • Use of specialized algorithms • Use of GPU-specific, low-level APIs • OpenCL • CUDA • Driving notion: Let software minimize the complexity and difficultly.

  4. Background: GPUs and OpenCL • Graphical Processing Unit (GPU) • Highly specialized coprocessor • Hundreds of cores • Thousands of hardware-managed threads • SIMT: Single Instruction, Multiple Thread • Variant of the common Single Instruction, Multiple Data (SIMD) model • Threads not on the execution path pause • Code executed in a “kernel” • Common GPU programming environments • OpenCL, which is Open Source • CUDA, which is NVIDIA proprietary • DEFG is designed for OpenCL

  5. High-Level GPU Architecture GPU CPU PCIe bus Cache? Local Cache Global RAM RAM bus • GPU Characteristic: • Processors often connected by Peripheral Component Interconnect Express (PCIe) bus • GPU has own fast Global RAM • Threads have a small amount of fast local memory • May or may not have a cache • Many hardware-controlled threads • Lacks CPU-style predictive branching, etc. Virtual Memory

  6. OpenCL Overview • Specification provided by Khronos Group • Open Source, multi-vendor • Hardware device support • GPUs • CPUs • Digital signal processors (DSPs) • Field-programmable gate arrays (FPGAs) • Device kernel normally written in C • Each thread shares a common kernel • CPU-side code • C/C++ • Very low-level, detailed CPU-side application interface (API) • Third-party bindings for Java, Python, etc.

  7. GPU Applications • Three components • Application algorithms • CPU-side code • Moves kernel code to GPU • Manages GPU execution and errors • Moves application data between CPU and GPU • May contain a portion of application algorithms • GPU kernel code • Can have multiple kernels per application • Each kernel usually contains an algorithm or algorithm step • Kernel code often uses GPU-specific techniques • This work concentrates on the CPU-side code

  8. GPU Performance • Major Issues in GPU Performance • Kernel Instruction Path Divergence • Occurs with conditional instructions (ifs, loops, etc.) • Causes some threads to pause • Needs to be minimized, if not totally avoided • High Memory Latency • Each RAM access can consume time of 200-500 instructions • Accesses to global RAM should be coalesced • “Bank conflicts” can occur with local thread memory • Rob Farber GPU suggestions [1]: • “Get the data on the GPU and leave it” • “Give the GPU enough work to do” • “Focus on data reuse to avoid memory limits” • Existing HPC code usually re-factored for GPU use

  9. DEFG Overview • GPU software development tool for OpenCL • Generates the CPU-side of GPU Applications • Uses a Domain Specific Language (DSL) • Developer writes CPU code in DEFG’s DSL • DEFG generates the corresponding CPU C/C++ program • Relative to hand-written CPU code • Faster by using declarative approach • Simpler byusing design patterns, and abstraction • Developer provides standard OpenCL GPU kernels

  10. DEFG Translator Architecture[2,3] • The DEFG generates the C/C++ code for CPU • DEFG Translator • DEFG Source Input • ANTLR-based Parser • XML-based Tree • Optimizer (Java) • Code Generator (C++) • Template Driven • C/C++ Output Translator:

  11. DEFG Benefits and Features • Implement OpenCL applications with less effort • Requires many fewer lines of code to be written • Encourages the developer to focus on the kernels • How is this done? • With a Doman-Specific Language (DSL) • Data characteristics are declared • Uses one or more pre-defined DEFG design patterns • Many details managed inside DEFG • Technical Features • Abstracts the OpenCL APIs and their details • Automatic optimization of buffer transfers • Handles error detection • Supports multiple GPU cards • Supports anytime algorithms

  12. DEFG Code Sample 01. declare application sobel 02. declare integer Xdim (0) 03. declare integer Ydim (0) 04. declare integer BUF_SIZE (0) 05. declare gpugpuone ( any ) 06. declare kernel sobel_filterSobelFilter_Kernels ( [[ 2D,Xdim,Ydim ]] ) 07. declare integer buffer image1 ( BUF_SIZE ) 08. integer buffer image2 ( BUF_SIZE ) 09. call init_input (image1(in) Xdim(out) Ydim(out) BUF_SIZE(out)) 10. execute run1 sobel_filter ( image1(in) image2(out) ) 11. call disp_output (image2(in) Xdim(in) Ydim(in) ) 12. end … status = clSetKernelArg(sobel_filter, 1, sizeof(cl_mem), (void *)&buffer_image2); if (status != CL_SUCCESS) { handle error } // *** execution size_tglobal_work_size[2]; global_work_size[0] = Xdim ; global_work_size[1] = Ydim ; status = clEnqueueNDRangeKernel(commandQueue, sobel_filter, 2, NULL, global_work_size, NULL, 0, NULL, NULL); if (status != CL_SUCCESS) { handle error } // *** result buffers status = clEnqueueReadBuffer(commandQueue, buffer_image2, CL_TRUE, 0, BUF_SIZE * sizeof(int), image2, 0, NULL, NULL); …

  13. DEFG Design Patterns • Invocation Patterns (Control Flow) • Sequential Flow • Single-Kernel Repeat Sequence • Multiple-Kernel • Concurrent-GPU Patterns (Multiple-GPU Support) • Multiple-Execution • Divide-Process-Merge • Overlapped-Split-Process-Concatenate • Prefix-Allocation (Buffer Allocation, without Locking) • Dynamic-Swap (Buffer Swapping) • Code-Morsel (C/C++ Code Insertion) • Anytime algorithm (Control Flow Change on External Event) • Design patterns can be combined • Example: Multiple-Kernel + Divide-Process-Merge + Code-Morsel

  14. DEFG Implementation • Lines of Code • ANTLR-based parser: 580 lines • Optimizer: 659 lines of Java • Code Generator: 1,513 lines of C++ • Templates and includes: 1,572 lines of C++ • Number of Diagnostic Programs: 15+ • Testing investment: Man Months • Faults tended to be in the C/C++ code generation • Most faults were in multi-GPU buffer management

  15. Diverse New DEFG Applications • Constructed four very diverse GPU applications • Image Processing: Sobel and Median Image Filters • Showcase for multiple GPU support • Graph Theoretic: Breadth-First Search (BFS), Large Graphs • Novel use of prefix sum to avoid GPU low-level locking • BFS processing with multiple GPUs • Sorting: Sorting Roughly Sorted Data • Implementation of novel sorting approach • Use of parallel prefix calculations in sorting optimization • Also shows multiple GPU support • Numerical: Iterative Matrix Inversion, M. Altman’s Method • Demonstrates anytime algorithms use of OpenCLclMath BLAS (Basic Linear Algebra Subprograms) • When a measure is met, anytime algorithm stops the process • These four applications demonstrate DEFG’s applicability

  16. Filter Application: Sobel Image Filter • Sobel operator detects edges in images • Pixel gradient calculated from 3x3 mask • Uses a single GPU kernel, invoked once • A base-line test application for multiple GPUs • Example of DEFG Sobel operator processing: Sobel

  17. Filter Application: Median Filter • Median filter removes “noise” from images • Median determined for 3x3 or 5x5 mask • Also uses a single GPU kernel, invoked once • 2nd base-line test application for multiple GPUs • Example of DEFG median 5x5 filter processing:

  18. Application: Breadth First Search (BFS) • Well-studied graph-theoretic problem • Focus: BFS with Very Large Irregular (VLI) Graphs • Social Networking, Routing, Citations, etc. • Many published GPU BFS approaches, starting with Harish [4] • Harish used “Dijkstra” BFS • Vertex frontier as a Boolean array • 1 = vertex on frontier • A GPU thread assigned to each vertex • Can result in poor thread utilization

  19. BFS Vertex Frontier • Merrill approach to vertex buffer management [5] • Have a buffer with multiple update threads • Uses prefix sum to allocate cells • Generalize this buffer management in DEF-G • Provided as a set of kernel functions • Useful for shared buffers with multiple GPU cards

  20. Application: Sorting Roughly Sorted Data • Goal: Improve on O(n log n) sortingbound when sequence is partially sorted • Based on the prior sorting work by T. Altman, et al. [6] • k is a measure of “sortedness” • A sequence is k-sorted if no element is more than k positions out of sequence • Knowing k allows for sorts of O(n log k) • If k is small then we obtain a substantial performance gain • The k-sorted trait can be GPU exploited • Prefix sum in calculating k • Parallel sorts of sub-sequences

  21. Parallel Roughly Sorting Algorithm • Step LR: Left-to-right scan, compute running max • Step RL: Right-to-left scan, compute running min • Step DM: • Uses LR-max array and RL-min array as inputs • Computes each elements distance measure • Step UB: Finds distance measure upper bound • Step Sorting: Using distance measure • Perform sort pass one • Perform sort pass two Notion: Convert the large sort problem into many smaller, parallel sort operations.

  22. Iterative Matrix Inversion (IMI) • DEFG iterative matrix inversion application using M. Altman’s method [7] • Use the Anytime-algorithm approach to manage the iterative inversion • Inversion is stoppable at “anytime” • Can balance run time against accuracy • Anytime management in DEFG, not the application • Requires GPU matrix operations • Use OpenCLclMath (APPML) • clMathintegration into DEFG

  23. M. Altman IMI Approach The initial inverse approximation, that is R0, can be formed by: R0 = αI where α= 1 / || A || ||A|| being the Euclidean norm of A and Iis the identity matrix. To invert matrix A, each iteration calculates: Rn+1= Rn(3I– 3ARn + (ARn)2) with the result in Rn+1. • Better R0estimate provides for quicker convergence • Method is self correcting • DEFG Anytime facility stops the iterations • When inversion quality measure is met • When max iterations have occurred • When max run time has occurred

  24. Accomplishments: DEFG Framework • Fully Implemented • Consists of approximately 5,000 code lines • 7 different applications and 15+ diagnostic programs • Complete User’s Guide • Packaged for general use • Design Patterns • 10+ Patterns • Patterns range from simple to complex • Delineation of DEF-G Limits • Explanation for success of DSL and design patterns

  25. DEFG’s Performance • DEFG Papers • Conference: Parallel and Distributed Processing Techniques and Applications (PDPTA’13) [2] • Conference: Parallel and Distributed Processing Techniques and Applications (PDPTA’14) [3] • Analysis • Three existing OpenCL applications converted to DEFG • CPU-side re-coded in DEFG and used existing GPU kernels • Three applications: • Breadth-First Search (BFS) • All-Pairs Shortest Path (APSP/FW) • Sobel Image Filter (SOBEL) • Output results carefully verified • Comparisons between “reference” and DEFG • Lines-of-Code Comparison • Run-time Performance Comparison

  26. Lines-of-Code Comparison On average, the DEFG code is 5.6 percent of the reference code.

  27. Run-Time Performance Comparison • Shown are original run times (average of 10 runs) • Later, made manual changes to understand timing differences • FW reference version was slow due to a vendor coding error • Likely CPU-based BFS-4096 was fast due to CPU’s cache • Summary: DEFG provided equal, or better, performance

  28. New Applications • Application Implementations • Filtering, BFS, Roughly Sorting, Iterative Inversion • Implementation Goals • Show general applicability of DEFG • Multiple-GPU: Filtering, BFS, and R. Sorting • Novel Algorithm: R. Sorting and Iterative Inversion • Proof of Concept: Iterative Inversion (BLAS usage) • Application Performance results • Run-time Performance • Single-GPU and multiple-GPU configurations • Problem-size characteristics • Vast majority of tests run on Hyrda server

  29. Image Filtering Results • Image Applications Implementation • Both Sobel operator and median filter: • Overlapped-split-process-concatenatedesign pattern • Single and Multiple-GPU versions • Analysis with large images and multiple GPUs • Image neighborhoods • Sobel operator: 3x3 grid • Median filter: • 3x3 grid: less computationally intense • 5x5 grid: more computationally intense

  30. Sobel Operator Application Results • Single-GPU refactored for multiple-GPU use • Used existing OpenCL kernel • Three simple DEFG code changes needed • New version used two GPUs • 50% image plus small overlap given to each • Produced same resultant image • Run-time performance was not impressive • Not sufficiently computationally intense • OpenCL transfer times went up • Kernel executions times stayed the same

  31. Median Filter Application Results • CPU-side DEFG code very similar to Sobel • Developed two new OpenCL kernels • 3x3 grid kernel • 5x5 grid kernel • Performance with 3x3 grid similar to Sobel • Performance with multiple-gpu, 5x5 median • Run-time improvement with all test images • With large image: 1.062 (1 GPU) seconds down to 0.794 (2 GPUs) • Speed up: 1.34 with 2 GPUs, with 7k x 7k image • Also, 2-GPU median filter handled larger images (22k x 22k) • Performance Analysis with 2 GPUs • Kernel execution run times dropped • CPU to GPU OpenCL transfer times increased • pthreads experiment showed need for O.S. threads

  32. Breadth-First Search Results • Breadth-First Search (BFS) Summary • DEFG generalization of Merrillapproach • Prefix-scan based buffer allocation: • “Virtual pointers” to nodes between GPUs • Used Harish sparse data structure approach • Analysis of BFS application • Characteristics • Capabilities • Run-time performance

  33. Multiple-GPU BFS Implementation • BFSDP2GPU DEFG Application • Re-factoring of previously-ported DEFG BFS application • DEFG implementation of complex OpenCL application • Management of shared buffers with prefix sum • Run-time communications between GPUs • Tested application against LVI graphs • Test graphs from SNAP and DIMACS repositories • Stanford Network Analysis Package (SNAP) [8] • Center for Discrete Mathematic and Theoretical Computer Science [9] • Very large graph datasets: millions of vertices and edges

  34. BFSDP2GPU Results • Analysis of BFSDP2GPU application • Characteristics • Kernel count went from 2 kernels to 6 • Used 2 GPUs • Capabilities and Run-time performance • Single-card versus multi-card performance • Performance relative to existing BFS application • Application results with LVI graphs • Processed large graphs (4.8M nodes, 69M edges) • However, unimpressive run-time performance • Run Times increased by factors of 6 to 17 • Issue: OpenCL’s lack of GPU-to-GPU communications • Lesser issue: Mixing of sparse and dense data structures

  35. Roughly Sorting • RSORT application implementation • Divide-process-merge pattern utilized • Implementation contains five kernels:(LRmax, RLmin, DM, UB, and comb_sort) • Sort selected for OpenCL kernel: comb sort • sort-in-place design • non-recursive • similar to bubble sort but much faster • elements are compared gap apart

  36. RSORT Results • Run-Time Comparisons using large datasets • Generated with set k value and dataset size • Fully perturbed dataset, or singly • Example with a k value of 4, 16 perturbed items:5 4 3 2 1 10 9 8 7 6 15 14 13 12 11 16 • Performance analysis over 3 configurations • QSORT on CPU, used as base line • RSORT with 1 GPU • RSORT with 2 GPUs

  37. RSORT Results Summary • Application Implementedin 1-GPU and 2-GPU forms • RSORT run times generally faster when k is small • At K:1000, 2-GPU to 1-GPU speed up is 1.73 • 2-GPU RSORT handles larger datasets than 1-GPU

  38. Iterative Matrix Inversion • IMIFLX application implementation • Used DEFG blasstatement to access clMath functions • Multiple-kernel-loop pattern used • Multiple blas statements per iteration • Blend of blas statements and kernels • Anytime used to end iterations at a time limit • Analysis of application • Inversion accuracy • Range of matrices: size and type • Used data from University of Florida Sparse Matrix Collection[10]

  39. Application Sample Result • IMIFLX uses 13 iterations for this 500 x 500 matrix • Norm value: ||(A*Rn) - I|| • Graph shows convergence to a solution • Run time was 0.259 seconds

  40. Iterative Matrix Inversion Results • Required BLAS support in DEFG • Anytime support: traded accuracy for less run time • Hydra’s NVIDIA T20 GPU • Available RAM: 2,678 MB • Limits double precision matrix to just over 8,000 by 8,000

  41. Dissertation Accomplishments • Designed, Implemented, and Tested DEFG • Produced DEFG “Enabling” Design Patterns • Compared DEFG Applications to Hand-Written • DEFG applications required less code • DEFG applications produced equal run times • Applied DEFG to Diverse GPU Applications • Four diverse applications fully implemented • Impressive application run-time results • with the exception of BFS, due to an OpenCL limit

  42. Future Research • Additional DEFG Design Patterns • Multiple-GPU Load Balancing • Resource sharing • Suggest DEFG Support for NVIDIA’s CUDA • Suggest a re-factored DEFG • internal DSL • More-standard programming environment • Enable support more environments • Not optimistic about declarative approach for GPU-side • Potential for other technical improvements

  43. DEFG’s Success • DEFG is a DSL focused on: HPC with GPUs • Note the Faber suggestions for GPU performance: • “Get the data on the GPU and leave it” • “Give the GPU enough work to do” • “Focus on data reuse to avoid memory limits” • The CPU becomes the orchestrator • DEFG provides the CPU code to orchestrate • Declarations to describe the data • Design patterns to describe the orchestration • Optimization to minimize data transfers

  44. References

  45. Additional Slides

  46. Raw Performance Numbersfor Three Applications, in Milliseconds

  47. Sample DEFG Code Showing a Sequence • 01. declare application floydwarshall • 02. declare integer NODE_CNT (0) • 03. declare integer BUF_SIZE (0) • 04. declare gpugpuone ( any ) • 05. declare kernel floydWarshallPassFloydWarshall_Kernels ( [[ 2D,NODE_CNT ]] ) • 06. declare integer buffer buffer1 ( BUF_SIZE ) • 07. integer buffer buffer2 ( BUF_SIZE ) • 08. call init_input (buffer1(in) buffer2(in) NODE_CNT(out) BUF_SIZE(out)) • 09. sequence NODE_CNT times • 10. execute run1 floydWarshallPass ( buffer1(inout) buffer2(out) NODE_CNT(in) DEFG_CNT(in) ) • 11. call disp_output (buffer1(in) buffer2(in) NODE_CNT(in)) • 12. end

  48. Sample DEFG Code Showing a Loop-While declare application bfs declare integer NODE_CNT (0) declare integer EDGE_CNT (0) declare integer STOP (0) declare gpugpuone ( any ) declare kernel kernel1 bfs_kernel ( [[ 1D,NODE_CNT ]] ) kernel kernel2 bfs_kernel ( [[ 1D,NODE_CNT ]] ) declare struct (4) buffer graph_nodes ( NODE_CNT ) integer buffer graph_edges(EDGE_CNT ) integer buffer graph_mask ( NODE_CNT ) integer buffer updating_graph_mask ( $NODE_CNT ) integer buffer graph_visited(NODE_CNT ) integer buffer cost (NODE_CNT) // note: init_input handles setting "source" node call init_input (graph_nodes(out) graph_edges(out) graph_mask(out) updating_graph_mask(out) graph_visited (out) cost (out) NODE_CNT(out) EDGE_CNT(out)) loop execute part1 kernel1 ( graph_nodes(in) graph_edges(in) graph_mask(in) updating_graph_mask(out) graph_visited(in) cost(inout) $NODE_CNT(in) ) // set STOP to zero each time thru... set STOP (0) // note: STOP value is returned... execute part2 kernel2 ( graph_mask(inout) updating_graph_mask(inout) graph_visited(inout) STOP(inout) NODE_CNT(in) ) while STOP eq 1 call disp_output (cost(in) NODE_CNT(in)) end

  49. RSORT Data

  50. IMIFLX Data

More Related