1 / 27

OpenCL Compiler Support Based on Open64 for MPUs+GPUs

OpenCL Compiler Support Based on Open64 for MPUs+GPUs. Yu-Te Lin, Chung- Ju Wu Chia -Han Lu, Shao -Chung Wang Jenq-Kuen Lee. Department of Computer Science, National Tsing Hua University, HsinChu , Taiwan. Outline. OpenCL overview Issues on supporting OpenCL

bruis
Télécharger la présentation

OpenCL Compiler Support Based on Open64 for MPUs+GPUs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OpenCL Compiler Support Based on Open64 for MPUs+GPUs Yu-Te Lin, Chung-Ju Wu Chia-Han Lu, Shao-Chung Wang Jenq-Kuen Lee Department of Computer Science, National TsingHua University, HsinChu, Taiwan

  2. Outline • OpenCL overview • Issues on supporting OpenCL • On-going implementation on Open64 • Summary & Discussion

  3. OpenComputing Language • Programming framework on heterogeneous platforms consisting of CPU, GPU , and other processors. • Initially developed by Apple Inc. and submitted to Khronos Group in 2008. • Open and Royalty-Free. • OpenCL Language • C language extension • built-in functions • OpenCL Runtime • platform APIs • runtime APIs OpenCL tutorial, IEEE HotChips, Aug.23, 2009

  4. OpenCL Framework overview Application OpenCL Kernels Host Program Separate programs into host-side and kernel-side code fragment __kernel void dot(__global const float4 *a __global const float4 *b __global float4 *c) { inttid = get_global_id(0); c[tid] = a[tid] * b[tid]; } int main(intargc, char **argv) { ... clBuildProgram(program, ...); clCreateKernel(program, “dot”...); ... } OpenCL Framework OpenCL Runtime OpenCL Compiler Compiler • compile OpenCL C language just-in-time Runtime • allow host program to manipulate context Runtime APIs Front-end Back-end Platform APIs Platform MPU : host, kernel program GPU : kernel program MPU GPU OpenCL_for_Halifux.pdf, OpenCL overview, Intel Visual Adrenaline

  5. OpenCL Sample Code main.c kernel.cl int runCL(int *a, int *b, int *c, int size) { // Device initialization • clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL); • ... // Program and Kernel Creation char *source = load_program_source(“kernel.cl”); cl_program program = clCreateProgramWithSource(context, 1, \ (const char**) &source, NULL, &err); cl_BuildProgram(program, 1, &device, NULL, NULL, NULL); cl_kernel kernel = clCreateKernel(program, “add”, &err); • cl_mem buffer = clCreateBuffer(context, CL_MEM_WRITE_ONLY, \ • NWITEMS*sizeof(cl_uint4), NULL, NULL ); ... // Device teardown clFinish(queue); ... } int main() { int a[100],b[100],c[100]; inti, size = 100; runCL(a, b, c, size); return 0; } Configure platform devices __kernel void add( __global float *a, __global float *b, __global float *c) { intgid = get_global_id(0); c[gid] = a[gid] + b[gid]; } Convert kernel source into character array OpenCL Compiler is embedded in API Dispatch kernels on devices int main() { int a[100],b[100],c[100]; int size = 100; inti; for(i=0;i<size;i++) c[i] = a[i] + b[i]; return 0; } Prepare memory object OpenCL Runtime

  6. OpenCL Execution Scenario Data-level parallelism X O kernel { code fragment 1 code fragment 2 code fragment 3 } kernel A { code 1 code 2 code 3 } A:1,2,3 data(0) A:1,2,3 data(1) A:1,2,3 data(2) A:1,2,3 data(3) A:1,2,3 data(4) A:1,2,3 data(5) Task-level parallelism kernel A { code 1 } kernel B { code 2 } kernel C { code 3 } 1 2 3 OpenCL Runtime A:1 B:2 C:3 A:1 B:2 C:3

  7. Supporting OpenCL • Syntax parsing by compiler • qualifier • vector • built-in function • Optimizations onsingle core • Runtime implementation • handle multi-core issues • Co-work with device vendor __kernel void add( __globalfloat4 *a, __globalfloat4 *b, __globalfloat4*c) { intgid = get_global_id(0); float4 data = (float4) (1.0, 2.0, 3.0, 4.0); c[gid] = a[gid] + b[gid] + data; } Platform APIs Runtime APIs MPU GPU

  8. OpenCL compiler implementation • Nowadays, LLVM is commonly used to be the solution. • Apple • ATI • NVIDIA • RapidMind • ... • Clang is the front-end parser.

  9. ATI SDK OpenCL support • We took ATI SDK to try OpenCL programs func 1204 ; __OpenCL_memset_kernel...... dcl_literal l7, 0x00000001, 0x00000001, 0x00000001, 0x00000001; int: 1dcl_literal l8, 0x00000004, 0x00000004, 0x00000004, 0x00000004; int: 4...... (more) ...... (more) dcl_literal l19, 0x00000040, 0x00000040, 0x00000040, 0x00000040; int: 64dcl_literal l20, 0x00000050, 0x00000050, 0x00000050, 0x00000050; int: 80dcl_num_thread_per_group 64, 1, 1 dcl_raw_uav_id(1)......... (more) endfunc ;ARGEND:__OpenCL_memset_kernelfunc 1205 ; memsetmov r176.x___, r1.xxxxmov r177, l7mov r177, r177.xxxxmov r178, l8mov r179, l9iadd r177, r177.xyz0, r178.000xmov r181, l10iadd r177, r177.xy0w, r181.00x0mov r181, l11mov r179, r179.xxxxmov r182, l12iadd r177, r177.x0zw, r181.0x00mov r1049, l13mov r2.x___, r176.xxxxmov r1.x___, r1049.xxxx call 1061 ; get32BitStorePrivateiadd r176, r179.xyz0, r182.000xmov r179, l14mov r1050, l15mov r2, r177mov r1.x___, r1050.xxxx call 1063 ; get128BitStorePrivateiadd r176, r176.xy0w, r179.00x0mov r179, l16mov r1051, l17mov r2, r177mov r1.x___, r1051.xxxx __kernel void memset( __global uint4 *dst ) { uint4 a = (uint4)(1, 2, 3, 4); uint4 b = (uint4)(97,98,99,100); uint4 c = a * b;dst[get_global_id(0)] = c; } call 1063 ; get128BitStorePrivateiadd r176, r176.x0zw, r179.0x00mov r1052, l18mov r2, r176mov r1.x___, r1052.xxxx call 1063 ; get128BitStorePrivatemov r1053, l19mov r2, r176mov r1.x___, r1053.xxxx call 1063 ; get128BitStorePrivatemov r1054, l17mov r1.x___, r1054.xxxx call 1066 ; get128BitLoadPrivatemov r177, r1imul r176, r177, r176mov r1055, l20mov r2, r176mov r1.x___, r1055.xxxx call 1063 ; get128BitStorePrivatemov r1056, l13mov r1.x___, r1056.xxxx call 1064 ; get32BitLoadPrivatemov r176.x___, r1.xxxxmov r177, l13mov r1.x___, r177.xxxx call 1027 ; get_global_idmov r177.x___, r1.xxxxishl r177.x___, r177.xxxx, r178.xxxxiadd r176.x___, r176.xxxx, r177.xxxxmov r1057, l20mov r1.x___, r1057.xxxx call 1066 ; get128BitLoadPrivatemov r177, r1mov r2, r177mov r1.x___, r176.xxxx call 1080 ; get128BitStoreUAV retendfunc ; memset LLVM O0 compilation dcl_literal l7, 0x00000061, 0x00000061, 0x00000061, 0x00000061; int: 97dcl_literal l8, 0x00000190, 0x00000190, 0x00000190, 0x00000190; int: 400.........dcl_literal l10, 0x00000129, 0x00000129, 0x00000129, 0x00000129; int: 297dcl_literal l11, 0x000000c4, 0x000000c4, 0x000000c4, 0x000000c4; int: 196......... func 1210 ; memsetmov r176.x___, r1.xxxx call 1111 ; __amdil_get_global_id_intmov r177, r1mov r178, l7mov r178, r178.xxxxmov r179, l8mov r177, r177.x000 mov r181, l9iadd r178, r178.xyz0, r179.000xmov r179, l10ishl r177.x___, r177.xxxx, r181.xxxxiadd r178, r178.xy0w, r179.00x0mov r179, l11iadd r176.x___, r176.xxxx, r177.xxxxiadd r177, r178.x0zw, r179.0x00mov r2, r177mov r1.x___, r176.xxxx call 1080 ; get128BitStoreUAV retendfunc ; memset { 97, 97, 97, 97 } { 97, 97, 97, 400 } internal & linker optimization { 97, 97, 297, 400 } { 97, 196, 297, 400 }

  10. Our experiences on Open64 • PACDSP Compiler • Based on Open64 compiler • Intermediate Representation:WHIRL • CGIR level implementation • PACDSP impact • Distributed Register Files • 5-way VLIW

  11. Previous Works for Distributed Register Files • Local register allocation • Compiler Supports and Optimizations for PAC VLIW DSP Processors [Lin, LCPC’05] • Register Allocation for VLIW DSP Processors with Irregular Register Files [Lin, CPC’06] • PALF: Compiler Supports for Irregular Register Files in Clustered VLIW DSP Processors [Lin, CC:PE’07] • Effective Code Generation for Distributed and Ping-Pong Register Files: a Case Study on PAC VLIW DSP Cores [Lin, JoVSPS’08] • Expression Rematerialization for VLIW DSP Processors with Distributed Register Files [Wu, CPC’09] • Global register allocation • A Local-Conscious Global Register Allocator for VLIW DSP Processors with Distributed Register Files [Lu, CPC’07] • LC-GRFA: Global Register File Assignment with Local Consciousness for VLIW DSP Processors with Non-uniform Register Files [Lu, CC:PE’09] • Local optimization • Copy Propagation Optimizations for VLIW DSP Processors with Distributed Register Files [Wu, LCPC’06] • Loop optimization • Enabling Compiler Flow for Embedded VLIW DSP Processors with Distributed Register Files [Chen, LCTES’07]

  12. Our on-going implementation Front-end Back-end .c .B .s cc142 .spin wgen42 WHIRL CGIR Opt asm vendor toolkits OpenCL spec. WHIRL node vendor runtimelibrary GCC 4 parser GCC 4 tree structure Very High WHIRL vendor architecture High WHIRL • Qualifier • WHIRL symbol table • Vector • initialization • operation • built-in function • extract essential modulefrom ATI SDK Mid WHIRL Low WHIRL Very Low WHIRL GPU asm code CPU asm code • preserve qualifier information • vector type optimizations

  13. Qualifier Parsing • Support OpenCL qualifiers • __global, __local, __constant, __private ,and __kernel • Add function “c_parser_ocl_qualifiers” to parse qualifier and build attribute tree node • Add function “handle_global_attribute” , “handle_local_attribute” and etc. to set flag. static tree c_parser_ocl_qualifiers (c_parser *parser){ …… attr = build_tree_list (attr_name, NULL_TREE); attrs = chainon(attrs,attr); …… return attrs; } static tree handle_global_attribute (tree *node,…….){ DECL_GLOBAL (*node) = 1; return NULL_TREE; }

  14. Qualifier Parsing

  15. Vector Parsing • Built-in Vector Data Types • Support OpenCL vector data types such as float4, int4… • With the help of GCC vector extension. • The idea is from using union to define OpenCL vector. • Create the built-in vector tree node in GCC front-end. typedef float __ocl_float4 __attribute((vector_size(16))); typedef union { struct {float x, y, z, w}; __ocl_float4 _ocl_vec; } float4; d = build_decl (FIELD_DECL, get_identifier (“w”, float_type_node); TREE_CHAIN(d) = decls; …… vec = build_vector_type_for_mode(float_type_node, V4SFmode); d = build_decl (FIELD_DECL, get_identifier ("_ocl_vec"), vec); TREE_CHAIN(d) = decls; …… record_builtin_type (RID_TYPEDEF, "float4", float4_type_node);

  16. Vector Parsing • Vector Initialization • Add c_parser_parend_init() for vector data initialization. • Similar to c_parser_braced_init() used for array initialization. • a = (float4) (3.0, 5.0, 7.0, 9.0) • Vector Assignment • Add additional fields in c_expr to record vector expressions. • a.x = b.y or a.xyzw = b.xxyy • Modify c_parser_postfix_expression_after_primary() to reference vector elements and store in c_expr. • Use build_modify_expr() to build vector assignment. • Vector Operation • Type casting for scalar or vectors with different number of components. • Use GCC’s vector operation if possible.

  17. Vector Parsing • Data Initialization INIT Symtab OpenCL source WHIRL

  18. Vector Parsing • Binary Operation

  19. Runtime Compilation Flow on Multi-Core CPU(x86) When the Runtime API clBuildProgramis invoked..... ATI SDK libatiocl.so builtin-x86.bc Internal optimizer and linker clc prelink.bc opt.s as kernel.cl ld Reuse stub code and metadata stub/metadata opencc OpenCL_kernel.s lib.c llvm-extract/llc builtin-x86.bc Open64 clc : OpenCL-LLVM compiler front-end .bc : LLVM Intermediate Representation

  20. ATI Sample Test

  21. Experiment Set 1 • x86 optimization testing • 4-core CPU, Linux OS, ATI SDK • Small programs for common optimizations If-conversion Global Code Hoisting unsigned int count = 0; unsigned inti; count += (input[i%100] > 5) ? 1 : 2; copy propagation for(i=0;i<5000;i++) { for(j=0;j<2000;j++) { output[101] = sum * 3; output[102] = sum / 2; output[i%100] = i + j; } } Loop Nested Optimization int a, b, c, d, e; a = value; b = a; c = b; d = c; e = d; output[100] = e; for(i=0;i<100;i++) for(j=0;j<100;j++) for(k=0;k<9000;k++) output[tid] = input[tid] * multiplier; dead code elimination inti, j, k;int sum, a=1, b=2; sum = a + b; for(i=0;i<100;i++) for(j=0;j<100;j++) for(k=0;k<9000;k++) a = i + j + k + b; output[100] = sum; constant folding intindex,sum;int a = 10, b = 2; sum = 1 + 2 + 3 + 4; sum = sum + 5 + 6 + 7 + 8 + 9 + 10; sum += a * b + index; common available expression inti;int a, b, c, d, e, f; d = a + b + c + 20; e = (a + b + c) * 5; f = a + b + c - 99;

  22. Experiment Set 2 • ATI OpenCL samples testing • 4-core CPU, Linux OS, ATI SDK • Real-world OpenCL samples for(uint k=0; k < blockWidth; k++) { uint index1 = (inverse)? i*blockWidth + k : k * blockWidth + i; uint index2 = getIdx(groupIdx, groupIdy, j, k, blockWidth, width); acc += dct8x8[index1] * input[index2]; } inter[j*blockWidth + i] = acc; barrier(CLK_LOCAL_MEM_FENCE); acc = 0.0f; for(uint k=0; k < blockWidth; k++) { uint index1 = i* blockWidth + k; uint index2 = (inverse)? j*blockWidth + k : k* blockWidth + j; acc += inter[index1] * dct8x8[index2]; } output[idx] = acc; DCT uint count = 0; float prev_diff = (diagonal[0] - x); count += (prev_diff < 0)? 1 : 0; for(uint i = 1; i < width; i += 1) { float diff = (diagonal[i] - x) - ((offDiagonal[i-1] * offDiagonal[i-1])/prev_diff); count += (diff < 0)? 1 : 0; prev_diff = diff; } return count; EigenValue inttid = get_global_id(0); inti = tid%width; int j = tid/width; float x0 = ((i*scale) - ((scale/2)*width))/width; float y0 = ((j*scale) - ((scale/2)*width))/width; float x = x0; float y = y0; float x2 = x*x; float y2 = y*y; float scaleSquare = scale * scale; uintiter=0; for(iter=0; (x2+y2 <= scaleSquare) && (iter < maxIterations); ++iter) { y = 2 * x * y + y0; x = x2 - y2 + x0; x2 = x*x; y2 = y*y; } Mandelbrot sum0.x += tempA0.x * tempB0.x + tempA0.y * tempB1.x + tempA0.z * tempB2.x + tempA0.w * tempB3.x; sum0.y += tempA0.x * tempB0.y + tempA0.y * tempB1.y + tempA0.z * tempB2.y + tempA0.w * tempB3.y; sum0.z += tempA0.x * tempB0.z + tempA0.y * tempB1.z + tempA0.z * tempB2.z + tempA0.w * tempB3.z; sum0.w += tempA0.x * tempB0.w + tempA0.y * tempB1.w + tempA0.z * tempB2.w + tempA0.w * tempB3.w; sum1.x += tempA1.x * tempB0.x + tempA1.y * tempB1.x + tempA1.z * tempB2.x + tempA1.w * tempB3.x; sum1.y += tempA1.x * tempB0.y + tempA1.y * tempB1.y + tempA1.z * tempB2.y + tempA1.w * tempB3.y; sum1.z += tempA1.x * tempB0.z + tempA1.y * tempB1.z + tempA1.z * tempB2.z + tempA1.w * tempB3.z; sum1.w += tempA1.x * tempB0.w + tempA1.y * tempB1.w + tempA1.z * tempB2.w + tempA1.w * tempB3.w; sum2.x += tempA2.x * tempB0.x + tempA2.y * tempB1.x + tempA2.z * tempB2.x + tempA2.w * tempB3.x; sum2.y += tempA2.x * tempB0.y + tempA2.y * tempB1.y + tempA2.z * tempB2.y + tempA2.w * tempB3.y; sum2.z += tempA2.x * tempB0.z + tempA2.y * tempB1.z + tempA2.z * tempB2.z + tempA2.w * tempB3.z; sum2.w += tempA2.x * tempB0.w + tempA2.y * tempB1.w + tempA2.z * tempB2.w + tempA2.w * tempB3.w; sum3.x += tempA3.x * tempB0.x + tempA3.y * tempB1.x + tempA3.z * tempB2.x + tempA3.w * tempB3.x; sum3.y += tempA3.x * tempB0.y + tempA3.y * tempB1.y + tempA3.z * tempB2.y + tempA3.w * tempB3.y; sum3.z += tempA3.x * tempB0.z + tempA3.y * tempB1.z + tempA3.z * tempB2.z + tempA3.w * tempB3.z; sum3.w += tempA3.x * tempB0.w + tempA3.y * tempB1.w + tempA3.z * tempB2.w + tempA3.w * tempB3.w; MatrixMultiplication

  23. Common Optimization Experiment • Taking multi-core CPU as OpenCL device to test optimizations • Comparison of Execution Time , y-axis is theimprovement

  24. ATI OpenCL Sample Experiment • The performance comparison based on ATI SDK LLVM is shown as the following graph. • We take four classical sample codes from ATI SDK.

  25. Reviewers Comment • Is it sufficient to test it on CPU instead of GPU? • difference: back-end code generation & runtime compilation flow • How to handle multi-dimensional thread execution model? • In OpenCL perspective, programmer is charge of it. • How is a kernel function compiled differently from a non-kernel function? • Host Program: Host Compiler • Kernel Program: OpenCL Compiler • Does your OpenCL compiler handle both MPU and GPU code or just the GPU code? If both, what is the MPU architecture? • OpenCL compiler only generates ONE target assembly code. • Why is ATI GPU targeted rather than the nVidia GPU? • Simply pick one up to start research work.

  26. Summary/Discussion • Supporting OpenCL • OpenCL compiler • OpenCL runtime • OpenCL compiler on Open64 • front-end & back-end • Preliminary Experiment • Open64 performs good optimizations • Refinements are still required • Future Work • Keep going on GPU targ-info

  27. Thank you !!

More Related