R-Stream High-Level Transformation Tool: State-of-the-Art and Objectives

Government Purpose Rights Purchase Order Number: N/A Agreement No.: HR001‐10‐3‐0007 Contractor Name: Intel Corporation Contractor Address: 2111 NE 25th Ave M/S JF2‐60, Hillsboro, OR 97124 Expiration Date: None The Government’s rights to use, modify, reproduce, release, perform, display, or disclose this technical data are restricted by paragraphs B (1),(3) and (6) of Article VIII as incorporated within the above purchase order and Agreement. No restrictions apply after the expiration data shown above. Any reproduction of the software or portions thereof marked with this legend must also reproduce the markings. The following entities, their respective successors and assigns, shall possess the right to exercise said property rights, as if they were the Government, on behalf of the Government.: University of Delaware – www.udel.edu; ETIInternational – www.etinternational.com; Intel Corporation – www.intel.com; Reservoir Labs – www.reservoir.com; University of California – San Diego – www.ucsd.edu; University of Illinois at Urbana-Champaign- www.illinois.edu. The R-Stream High-Level Transformation Tool: State of the Art and Objectives Within the UHPC Program N. Vasilache , R. Lethin

Outline • R-Stream Overview • UHPC Goals • Some Performance Results

Power Efficiency Driving Architectures NUMA Heterogeneous Processing SIMD SIMD SIMD SIMD FPGA FPGA FPGA FPGA SIMD SIMD SIMD SIMD DMA DMA DMA DMA Memory Memory Memory Memory Distributed Local Memories GPP GPP GPP GPP Hierarchical (including board, chassis, cabinet) SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD Explicitly Managed Architecture Multiple Execution Models Bandwidth Starved Mixed Parallelism Types Multiple Spatial Dimensions

Computation Choreography • Expressing it in the program: • Annotations and pragma dialects for C • Chapel subset (UHPC in progress with UIUC) • CnC subset (UHPC in progress with Intel) • Generating it: • Explicitly (e.g., new languages like CUDA, target specific ) • Implicitly (UHPC in progress: libraries, runtime abstractions CnC) • But before expressing it, how can programmers find it? • Manual constructive procedures, art, sweat, time • Artisans get complete control over every detail • Fully-automatic • Operations research problems and (advanced) autotuning • Faster, sometimes better, than a human Not our focus

Program Transformations Specification • Schedule maps iterations to multi-dimensional time: • A feasible schedule preserves dependences • Placement maps iterations to multi-dimensional space: • UHPC in progress, partially done • Layout maps data elements to multi-dimensional space: • UHPC in progress • Hierarchical by design, tiling serves separation of concerns iteration space of a statement S(i,j) t2 j i t1

Polyhedral Slogans • Parametric imperfect loop nests • Subsumes classical transformations • Compacts the transformation search space • Parallelization, locality optimization (communication avoiding) • Preserves semantics • Analytic joint formulations of optimizations • Not just for affine static control programs

R-Stream Blueprint Polyhedral Mapper Machine Model Raising Lowering EDG C Front End Pretty Printer (CUDA, C+annotations, pthreads …) Scalar Representation Extended Representation CnC High-Level C Low-Level CnC / Chapel Front End (UHPC in progress)

Mapping Process for Explicitly Managed Memories Dependencies 2- Task formation: - Coarse-grain atomic tasks - Master/slave side operations 1- Scheduling: Parallelism, locality, tilability 3- Placement: Assign tasks to blocks/threads • Local / global data layout optimization • Multi-buffering (explicitly managed) • Synchronization (barriers) • Bulk communications • Thread generation -> master/slave • Target-specific optimizations

Model for Scheduling Trades 3 Objectives Jointly Loop Fission Fewer Global Memory Accesses More Parallelism More Locality Sufficient Occupancy Loop Fusion + successive thread contiguity + successive thread contiguity Memory Contiguity Better Effective Bandwidth Patent pending

Inside the R-Stream Mapper Optimization modules engineered to expose advanced “knobs” used by auto-tuner Extended GDG representation Tactics Module Tiling Placement Comm. Generation Parallelization Locality Optimization … Memory Promotion Sync Generation Layout Optimization Polyhedral Scanning Jolylib, …

Optimization Across BLAS Calls Numerous cache misses /* Global Optimization*/ doall loop { … for loop { … [read from Z] … [write to Z] … [read from Z] } … } /* Optimization with BLAS */ for loop { … BLAS call 1 … BLAS call 2 … … BLAS call n … } Can parallelize outer loop(s) Outer loop(s) Retrieve data Z from disk Store data Z back to disk Retrieve data Z from disk !!! Loop fusion can improve locality VS. → Global optimization exposes better parallelism and locality (significant speedups)

Codelets From a HLC perspective • Codelets have: • Fine granularity • Explicit communication • Point to point, other kinds of synchronization • Can utilize scheduling and dependence information hints • Should also use placement of data and computation hints • Work from local scratch pad memories • Good match for UHPC hardware, allows good control for energy, resilience, etc.

UHPC from HLC perspective • Energy • must minimize data motion/communication • Near Threshold Voltage • must find even more parallelism • Resilience • synergy needed with new checkpointing/recovery models • Self awareness • dynamic distributed feedback and regulation

Another Observation • But programming directly in codeletsis impractical: • Exposing machine details is a good thing, but don’t want programmers to manage them. • Too complicated: getting it done, getting it right, getting it fast. (Complexity = parallelism x locality x resilience…) • Writing directly in codelets will also overs-pecify the program, bake to one machine, and defeat portability • Role of HLC is to take high level abstractions from programmer • sequential code, • Chapel, CnC, • data-parallel idioms, • math language • Perform optimization to various levels of the target hardware hierarchy

Based on R-Stream Technology

Goal: Generating CnC • Assume a mapping from CnC -> Codelets • Advantages of CnC • More succinct expression of parallelism (the skewing problem) • Adaptable parallelism and load balancing • High-level representation of data parallel idioms • CnC help solve the irregular, idiomatic part of the problem • R-Stream can target optimizations across irregular idioms • Easy to test for correctness of generated code and execute efficiently on x86 / clusters.

Goal: Synergy with CnC • Represent CnC action-attribute graphs explicitly in R-Stream: • Benefit from optimization across multiple CnC steps • Explore tradeoff between fusing steps and running them in parallel: • Fused steps reduce the runtime overhead • An also the memory footprint • Generate many semantically equivalent versions and explore the design space tradeoffs • R-Stream auto-tuning mode will help a lot here

Goal: Synergy with Chapel and UIUC • Extensions to blackboxing: • User interface, can represent any program • Supports even linking with precompiled code • Integrate user-specific data distributions within R-Stream • HTAs • Locales • Find the right abstraction • The goal for Rstream to understand the abstraction and make good mapping decisions; not to replace the user choices • Iterative, feedback-directed design • Language / transformation tool • Transformation tool / Runtime • Language / Runtime

Goal: Pragmatic Approach • Support multiple kinds of placement: • Explicit / implicit ; virtual / physical ; linear/ cyclic/ block cyclic/ general • Build on R-Stream’s current over-provisioning for performance: • Originally built for CUDA performance • Concepts extend to any architecture with dynamic scheduling decisions • Has implications on locality/communication granularity • Examine implications on power • Use advanced auto-tuning features for design space exploration • Explore which modes perform best with CnC: • Dependent on how over-provisioning is implemented • Over-provisioning (may) have implications on memory persistence: • Opportunities / loss of high-level reuse and communication optimizations

Goal: HLC support for Challenge Applicationss • Go beyond loop nest optimizations • Chapel / data-parallel support • CnC attribute action graph optimization • SAR • New locality transformations demonstrated speedups on linear flight path (reported to DARPA) • MD • Exploring HLC optimization to neutral territory methods • Graph • High level approaches to optimizing graph algorithms and increasing locality, new lock-free data-parallel algorithm for BFS • Chess, Hydrodynamics • TBD.

CSLC-LMS (Mapping Across Function/Library Calls) Configuration 1: MKL Configuration 2: Low-level compilers • Main comparisons: • R-Stream High-Level C Compiler 3.1.2 • Intel MKL 10.2.1 • Dual quad-core E5405 Xeon processors (8 cores total), 9GB memory, 8 thr Radar code Radar code GCC MKL calls ICC Configuration 3: R-Stream Radar code Optimized radar code R-Stream GCC ICC

CSLC-LMS (Mapping Across Function/Library Calls)

RTM (Exploiting Over-Provisioning for Performance) • void RTM_3D(double (*U1)[Y][X],double (*U2)[Y][X],double (*V)[Y][X], • intpX, intpY, intpZ) { • double temp; • inti, j, k; • for (k=4; k<pZ-4; k++) { • for (j=4; j<pY-4; j++) { • for (i=4; i<pX-4; i++) { • temp = C0 * U2[k][j][i] + • C1 * (U2[k-1][j][i] + U2[k+1][j][i] + • U2[k][j-1][i] + U2[k][j+1][i] + • U2[k][j][i-1] + U2[k][j][i+1]) + • C2 * (U2[k-2][j][i] + U2[k+2][j][i] + • U2[k][j-2][i] + U2[k][j+2][i] + • U2[k][j][i-2] + U2[k][j][i+2]) + • C3 * (U2[k-3][j][i] + U2[k+3][j][i] + • U2[k][j-3][i] + U2[k][j+3][i] + • U2[k][j][i-3] + U2[k][j][i+3]) + • C4 * (U2[k-4][j][i] + U2[k+4][j][i] + • U2[k][j-4][i] + U2[k][j+4][i] + • U2[k][j][i-4] + U2[k][j][i+4]); • U1[k][j][i] = 2.0f * U2[k][j][i] - U1[k][j][i] + V[k][j][i] * temp; • } } } } 25-point 8th order (in space) stencil

RTM (Exploiting Over-Provisioning for Performance) • 3D discretized wave equation kernel with single time iteration • Run on NVIDIA GTX 480 • Double Precision 256^3 Problem • High-Performance from Over-Provisioning space exploration and explicit optimization of register rotation and shared memory reuse

R-Stream to CnC Proof of Concept • Examined feasibility and benefits of automatic coordination language (CnC )generation from R-Stream: • on 4-D stencil, in-place, kernel application • coarse grained parallelism is pipelined (i.e. wavefronts of parallel tasks) and representative of other streaming kernels • Rstream generates a non-trivial OpenMP version • Manually transform this OpenMP version to CnC code • Process completely automatable

R-Stream to CnC Proof of Concept

Conclusion • R-Stream simplifies software development and maintenance • Does this by automatically parallelizing loop code • While optimizing for data locality, coalescing, communications reuse, etc. • Many exciting developments within UHPC

R-Stream High-Level Transformation Tool: State-of-the-Art and Objectives

R-Stream High-Level Transformation Tool: State-of-the-Art and Objectives

Presentation Transcript

R OSETTA N ET

Dr. N. R. Ansari

I R A N

R U N L O L A R U N

R N A

R Y N E R O B I N S O N

N R = pDv / η

Learn O,R,N

R i v e r B a n n

R. N. Manchester

N. R. E. I

N ≅ R

n = 17665, r = 0.9736796

R & N Plumbing, Inc.