Challenges in Binary Translation for Desktop Supercomputing

Challenges in Binary Translation for Desktop Supercomputing David Kaeli Rodrigo Dominguez Department of Electrical and Computer Engineering Northeastern University Boston, MA

Current trends in Many-core Computing • The CPU industry has elected to jump off the cycle-time scaling bandwagon • Power/thermal constraints have become a limiting factor • We now see CPU vendors placing multiple (10’s of) cores on a single chip • Clock speeds have not changed • The memory wall persists and multiple cores that assume a shared-memory model place further pressure on this problem • Software vendors are looking for new parallelization technology • Multi-core aware operating systems • Semi-automatic parallelizing compilers

Current trends in Many-core Computing • There has been a renewed interest in parallel computing paradigms and languages • Existing many-core architectures are being considered for general-purpose platforms (e.g., Cell, GPUs, DSPs) • Heterogeneous systems are becoming a common theme • The trend will only accelerate if proper programming frameworks are available to effectively exploit many-core resources

Graphics Processors • Graphics Processing Units • More than 64% of Americans played a video game in 2009 • High-end - primarily used for 3-D rendering for videogame graphics and movie animation • Mid/low-end – primarily used for computer displays • Manufacturers include NVIDIA, AMD/ATI, IBM-Cell • Very competitive commodities market

GPU Performance • GPUs provide a path for performance growth • Cost and power usage numbers are also impressive Near exponential growth in performance for GPUS!! Source:NVIDIA 2009

Comparison of CPU and GPU Hardware Architectures CPU: Cache heavy, focused on individual thread performance GPU: ALU heavy, massively parallel, throughput-oriented

CPU/GPU Relationship CPU (host) GPU w/ local DRAM (device)

A wide range of GPU apps • Film • Financial • Languages • GIS • Holographics cinema • Machine learning • Mathematics research • Military • Mine planning • Molecular dynamics • MRI reconstruction • Multispectral imaging • N-body simulation • Network processing • Neural network • Oceanographic research • Optical inspection • Particle physics • 3D image analysis • Adaptive radiation therapy • Acoustics • Astronomy • Audio • Automobile vision • Bioinfomatics • Biological simulation • Broadcast • Cellular automata • Fluid dynamics • Computer vision • Cryptography • CT reconstruction • Data mining • Digital cinema / projections • Electromagnetic simulation • Equity training • Protein folding • Quantum chemistry • Ray tracing • Radar • Reservoir simulation • Robotic vision / AI • Robotic surgery • Satellite data analysis • Seismic imaging • Surgery simulation • Surveillance • Ultrasound • Video conferencing • Telescope • Video • Visualization • Wireless • X-Ray

GPU as a General Purpose Computing Platform • Speedups are impressive and ever increasing! Real Time Elimination of Undersampling Artifacts Lattice-Boltzmann Method for Numerical Fluid Mechanics Genetic Algorithm Total Variation Modeling 2300 X 1840 X 1000 X 2600 X Monte Carlo Simulation Of Photon Migration Stochastic Differential Equations K-Nearest Neighbor Search Fast Total Variation for Computer Vision 1000 X 675 X 470 X 1000 X Source: CUDA Zone at www.nvidia.com/cuda/

GPGPU is becoming mainstream research • Research activities are expanding significantly Search result for keyword “GPGPU” in IEEE and ACM

TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC SM SM SM Streaming Processor Array Grid of thread blocks Multiple thread blocks, many warps of threads Texture Processor Cluster Streaming Multiprocessor NVIDIA GT200 architecture SP SP • 240 shader cores • 1.4B transistors • Up to 2GB onboard memory • ~150GB/sec BW • 1.06 SP GFLOPS • CUDA and OpenCL support • Programmable memory spaces • Tesla S1070 provides 4 GPUs in a 1U unit SP SP SFU SFU SP SP SP SP Texture Unit Individual threads

AMD/ATI Radeon HD 5870 • Codename “Evergreen” • 1600 SIMD cores • L1/L2 memory architecture • 153GB/sec memory bandwidth • 2.72 TFLOPS SP • OpenCL and DirectX11 • Hidden memory microarchitecure • Provides for vectorized operation

Comparison of CPU and GPU Hardware Architectures Source: NVIDIA, AMD and Intel

AMD vs. NVIDIA

Talk Outline • Introduction on GPUs • Overview of the tool chains for both CUDA and OpenCL • Motivation for pursuing this work • Comparing intermediate representations • Leveraging/analyzing benefits of Open64 optimization on AMD GPUs • Comparing challenges with fundamentally different ISAs (SS SIMT versus VLIW SIMT) • Discuss PTX and IL • Describe new common IR • Two examples of PTX->IR->IL binary translation • Discuss status of project and future work

GPU Programming Model • Single Instruction Multiple Threads (SIMT) • Parallelism is implicit • Programs (also called kernels or shaders) are generally small and contain nested loops • Synchronization is handled explicitly

Toolchains • Toolchain = compiler + runtime library C for CUDA OpenCL Brook+ OpenCL CUDA Runtime CAL Runtime Graphics driver Graphics driver GPU GPU NVIDIA AMD

CUDA Compiler c for cuda compile-time cudafe gpu host Open64 ptx* host compiler exe execution-time binary runtime driver * ptx is included as data in the host application

OpenCL (Dynamic) Compiler OpenCL compile-time host compiler exe execution-time OpenCL Library LLVM binary runtime driver

Objectives of our work • Compare two different IRs from similar massively-threaded architectures • Influence future IR design (an active topic in GPGPU research) • Leverage/analyze benefits of Open64 optimizations • Compare challenges with fundamentally different ISAs: Superscalar/SIMT versus VLIW/SIMT

CUDA Runtime • Device Management • cudaSetDevice, cudaGetDevice • Memory Management • Allocation: cudaMalloc, cudaFree • Transfer: cudaMemcpy, cudaMemset • Execution Control • Kernel launch: cudaLaunch • Config: cudaConfigureCall • Thread Management • cudaSynchronize

CUDA Runtime (Vector Add example) __global__ void vecAdd(int A[ ], int B[ ], int C[ ]) { int i = threadIdx.x; C[i] = A[i] + B[i]; } int main() { int hA[ ] = {…}; int hB[ ] = {…}; cudaMemcpy(dA, hA, sizeof(hA), HostToDevice); cudaMemcpy(dB, hB, sizeof(hB), HostToDevice); vecAdd<<<1, N>>>(dA, dB, dC); cudaMemcpy(dA, hA, sizeof(hA), DeviceToHost); } cudaConfigureCall cudaSetupArgument cudaLaunch

NVIDIA PTX • Low-level IR (close to ISA) • Pseudo-assembly style syntax • Load-Store instruction set • Strongly typed language • cvt.s32.u16 %r1, %tid.x; • Unlimited virtual registers • Predicate registers

AMD IL • High-level IR • Structured control flow (if-endif, while-end, switch-end) • No predication • 32-bit registers (4 components) - vectorization

Common PTX and IL instructions vectorAdd (PTX) mov.u16 %rh1, %ctaid.x; mov.u16 %rh2, %ntid.x; mul.wide.u16 %r1, %rh1, %rh2; cvt.u32.u16 %r2, %tid.x; add.u32 %r3, %r2, %r1; ld.param.s32 %r4, [N]; setp.le.s32 %p1, %r4, %r3; @%p1 bra $LabelA; cvt.u64.s32 %rd1, %r3; mul.lo.u64 %rd2, %rd1, 4; ld.param.u64 %rd3, [A]; add.u64 %rd4, %rd3, %rd2; ld.global.f32 %f1, [%rd4+0]; ld.param.u64 %rd5, [B]; add.u64 %rd6, %rd5, %rd2; ld.global.f32 %f2, [%rd6+0]; add.f32 %f3, %f1, %f2; ld.param.u64 %rd7, [C]; add.u64 %rd8, %rd7, %rd2; st.global.f32 [%rd8+0], %f3; $LabelA: exit; • Data movement (mov) • Memory access (ld, st) • Arithmetic (mul, add) • Conversion (cvt) • Comparison and selection (setp) • Control flow (bra): uses predication for conditional branch

Common PTX and IL instructions vectorAdd (IL) mov r0, vThreadGrpId.x mov r1, cb0[0].x imul r2, r0, r1 mov r3, vTidInGrp.x iadd r4, r3, r2 mov r5, cb1[3] ige r6, r4, r5 if_logicalz r6 mov r7, r4 imul r8, r7, l0 mov r9, cb1[0] iadd r10, r9, r8 uav_raw_load_id(0) r11, r10 mov r12, cb1[1] iadd r13, r12, r8 uav_raw_load_id(0) r14, r13 add r15, r11, r14 mov r16, cb1[2] iadd r17, r16, r8 uav_raw_store_id(0) mem.xyzw, r17, r15 endif end • Data movement (mov) • Memory access (uav_raw) • Arithmetic (imul, iadd) • No conversion instructions • Comparison and Selection (ige) • Control Flow (if_logicalz): structured statements

Ocelot Framework* • Implemented as a CUDA library • Intercepts library calls • PTX Emulation on the CPU • Parses PTX into an internal IR • Analysis: CFG, SSA, Data flow, optimizations • Our work: • IR for IL programs • PTX IR -> IL IR translation • AMD/CAL Backend *Andrew Kerr, Gregory Diamos, and Sudhakar Yalamanchili. Modeling gpu-cpu workloads and systems. In GPGPU ’10: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pages 31–42, New York, NY, USA, 2010. ACM.

Translation Framework compile-time ATI driver exe Ocelot ptx parser analysis translation to IL CAL back-end

IL Control Tree • Based on Structural Analysis* • Build DFS spanning tree of the control flow graph and traverse in postorder • Form regions and collapse the nodes in the CFG • Construct the Control Tree in the process • Repeat until only 1 node is left in the CFG *S. Muchnick. Advanced Compiler Design and Implementation, chapter 7.7. Morgan Kaufmann, 1997.

IL Control Tree Entry abstract node representing regions WHILE cond body BB IF false cond true BB BB BB

Example 1 (if-then) Entry PTX mov.u16 … setp.le.s32 p1, r4, r3 @p1 bra LabelA cvt.u64.s32 … LabelA: exit Block BB: mov.. IF BB: exit cond true BB: setp.. BB: cvt…

Example 1 (if-then) Entry IL mov … ige r6, r4, r5 if_logicalz r6 mov … endif end Block BB: mov.. IF BB: exit cond true BB: setp.. BB: cvt…

Example 2 (for-loop) Entry + Block PTX mov.u16 … setp.le.s32 p1, r5, r3 @p1 bra LabelA cvt.u64.s32 … LabelB: … setp.lt.s32 p2, r4, r5 @p2 bra LabelB LabelA: exit BB: mov.. IF BB: exit cond true BB: setp.. Block BB: cvt… WHILE cond body setp …

Example 2 (for-loop) Entry + Block IL mov … ige r7, r4, r6 if_logicalz r7 mov … whileloop … if_logicalz r17 break endif endloop endif end BB: mov.. IF BB: exit cond true BB: setp.. Block BB: cvt… WHILE body cond setp …

Other BT Challenges • Pointer arithmetic in CUDA needs to be emulated in CAL • Translate Application Binary Interface (ABI), e.g. different calling conventions • Architectural bitness: Tesla and Cypress are 32-bit architectures but Fermi is 64-bits

Project Status • Main CUDA library API’s are implemented (cudaMalloc, cudaMemcpy, cudaLaunch, etc.) • 3 CUDA applications from the SDK running • Code quality comparable to LLVM code generation

Next Steps • Enhance translation of the Control Tree to support other IL constructs (e.g., switch-case) • Implement other GPGPU abstractions (e.g., shared memory, textures, etc.) • Handle PTX predicated instructions (since IL does not support predication directly)

Summary and Future Work • GPUs are revolutionizing desktop supercomputing • A number of critical applications have been migrated successfully • CUDA and OpenCL have made these platforms much more accessible for general purpose computing • AMD presently has the highest DP FP performance • CUDA presently produces higher performance code for NVIDIA • We are developing a platform that leverages the best of both worlds 

Challenges in Binary Translation for Desktop Supercomputing