Procedure Cloning and Integration for Converting Parallelism from Coarse to Fine Grain

Procedure Cloning and Integrationfor Converting Parallelism from Coarse to Fine Grain Won So & Alex Dean Center for Embedded Systems Research Department of Electrical and Computer Engineering NC State University

Overview • I. Introduction • II. Integration Methods • III. Overview of the Experiment • IV. Experimental Results • V. Conclusions and Future Work

I. Introduction • Motivation • Multimedia applications are pervasive and require a higher level of performance than previous workloads. • Digital signal processors are adopting ILP architectures such as VLIW/EPIC. • Philips Trimedia TM100, TI VelociTI architecture, BOPS ManArray, and StarCore SC120, etc. • Typical utilization is low from 1/8-1/2. • Not enough independent instructions within limited instruction window. • A single instruction stream has limited ILP. • Exploit thread level parallelism (TLP) with ILP. • Find far more distant independent instructions (coarse-grain parallelism): Exists various levels (e.g. loop level, procedure level)

I. Introduction (cont.) • Software Thread Integration (STI) • Software technique which interleaves multiple threads at the machine instruction level • Previous work focused on Hardware-to-Software Migration (HSM) for low-end embedded processors. • STI for high-end embedded processors • Integration can produce better performance • Increases the number of independent instructions. • Compiler generates a more efficient instruction schedule. • Fusing/jamming multiple procedure calls into one • Convert procedure level parallelism to ILP • Goal: Help programmers make multithreaded programs run faster on a uniprocessor

I. Introduction (cont.) • Previous work I: Multithreaded architectures • SMT, Multiscalar • SM (Speculative Multithreading), DMT, XIMD • Welding, Superthreading  STI achieves multithreading on uniprocessors with no architectural support. • Previous work II: Software Techniques • Loop jamming or fusion  STI fuses entire procedures, removing the loop-boundary constraints. • Procedure Cloning  STI makes procedure clones which do the work of multiple procedures/calls concurrently.

II. Integration Methods • Identify the candidate procedures • Examine parallelism • Perform integration • Select an execution model to invoke best clone

II-1. Identify the Candidate Procedures • Profile application • In multimedia applications, those procedures would be DSP-Kernels : filter operations (FIR/IIR), frequency-time transformations (FFT, DCT) • Example: JPEG. (from gprof)

II-2. Examine Parallelism • Integration requires concurrent execution. • If not already identified, find purely independent procedure-level data parallelism. • Each procedure call handles its own data set, input and output. • Those data sets are independent of each other. • Abundant because multimedia applications typically process their large data by separating into blocks. (e.g. FDCT/IDCT) • More details in [So02] Goal: Help programmers make multithreadedprograms run faster on a uniprocessor

II-3. Perform Integration • Design the control structure of the integrated procedure. • Use Control Dependence Graph (CDG). • Care for data-dependent predicates. • Techniques can be applied repeatedly and hierarchically. • Case a: Function with a loop (P1 is data-independent)

II-3. Perform Integration (cont.) • Case b: Loop with a conditional (P1 is data-dependent and P2 is not) • Case c: Loop with different iterations (P1 is data-dependent)

II-3. Perform Integration (cont.) • Two levels of integration: assembly and HLL • Assembly: Better control but requires scheduler. • HLL: Use compilers to schedule instructions. • Which is better depends on capabilities of the tools and compilers • Code transform: Fusing (or jamming) two blocks • Duplicate and interleave the code. • Rename and allocate new local variables and parameters. • Superset of loop jamming • Not only jamming the loops but also the rest of the function. • Allows larger variety of threads to be jammed together. • Two side effects: Negative impact on the performance • Code size increase: If it exceeds I-cache size • Register pressure: If it exceeds the # of physical registers

II-4. Select an Execution Model • Two approaches: ‘Direct call’ and ‘Smart RTOS’ 1) Direct call: Statically bind at compile time • Modify the caller to invoke a specific version of the procedure every time (e.g. 2-threaded clone). • Simple and appropriate for a simple system. • Same approach is used in Procedure Cloning. • If multiple procedures have been cloned, each may have a different optimal # of threads 2) Call via RTOS: Dynamically bind at run time • The RTOS selects a version at run time based on expected performance. • Adaptive and appropriate for a more complex system. Application Application Int. Proc.Clones Other Procedures Other Procedures RTOS Int. Proc. Clones 1) Direct Call 2) Call via RTOS

TA TB TC Applications fork request Smart RTOS Queue for pending requests Scheduler Thread library T1_2 T2_3 T1_3 Integrated Threads (High ILP, fast) T1 T2 T3 Discrete Threads (Low ILP, slow) II-4. Select an Execution Model (cont.) • Smart RTOS model: 3 levels of execution • Applications: Thread-forking requests for kernel procedures • Thread library: Contains discrete and integrated versions. • Smart RTOS: Chooses efficient version of the thread.

III. Overview of the Experiment • Objective • Build up the general approach to perform STI. • Examine performance benefits and bottlenecks of STI. • Did not focus on a Smart RTOS model. • Sample application: JPEG • Standard image compression algorithm. • Obtained from Mediabench • Input: 512x512x24bit lena.ppm • 2 applications: Compress and decompress JPEG

III. Overview of the Experiment (cont.) • Integration method • Integrated procedures • FDCT (Forward DCT) in CJPEG • Encode (Huffman Encoding) in CJPEG • IDCT (Inverse DCT) in DJPEG • Methods • Manually integrate threads at C source level. • Build 2 integrated versions: integrating 2 and 3 threads • Executed them with a ‘direct call’ model. • Experiment • Compile with various compilers: GCC, SGI Pro64, ORC (Open Research Compiler) and Intel C++ Compiler. • Run on EPIC machine: ItaniumTM running Linux for IA-64. • Evaluate the performance: With the PMU (Performance Monitoring Unit) in ItaniumTM using the software tool pfmon.

III. Overview of the Experiment (cont.) Applications GCC: GNU C Compiler Pro64: SGI Pro64 Compiler ORCC: Open Research Compiler Intel: Intel C++ Compiler -O2: Level 2 optimization -O3: Level 3 optimization -O2u0: -O2 without loop unrolling Threads IDCT_NOSTI FDCT_NOSTI Encode_NOSTI IDCT_STI2 FDCT_STI2 Encode_STI2 IDCT_STI3 FDCT_STI3 Encode_STI3 Compile Compilers and Intel GCC Pro64 ORCC -O2 / -O3 -O2 -O2 -O2 / -O3 Optimizations / -O2u0 Platform Measure Results Performance / IPC Cycle breakdown

IV. Experimental Results • Measured and plotted data • CPU Cycles (execution time), speedup by STI, and IPC. • Normalized performance: compared with NOSTI/GCC-O2. • Speedup by STI: compared with NOSTI compiled with each compiler. • IPC = number of instructions retired / CPU cycles. • Cycle and speedup breakdown. • Cycle breakdown • 2 categories of cycle: Inherent execution and stall • 7 sources of stall: Instruction access, data access, RSE, dependencies, issue limit, branch resteer, taken branches • Speedup breakdown: sources of speedup and slowdown • Code Size • Code size of the procedure

Sweet spot varies between one, two and three threads STI speeds up best compiler (Intel-O2-u0) by 17% Code expansion for function is 75% to 255% IPC does NOT correlate well with performance IV. Experimental Results – FDCT in CJPEG

IV. Cycle Breakdown – FDCT in CJPEG I-cache miss is crucial in Intel. (big code size by loop unrolling) I-cache misses are reduced significantly after disabling loop unrolling. Sources of speedup: Inh.Exe, DataAcc, Dep, IssueLim Sources of slowdown: InstAcc, BrRes

IV. Experimental Results – EOB in CJPEG NOSTI: best with Intel-O3. Speedup by STI: 0.69%~17.38% 13.61% speedup over best compiler

IV. Cycle Breakdown – EOB in CJPEG I-cache miss is not crucial though it tends to increase after integration. Sources of speedup: Inh.Exe, DataAcc, IssueLim Sources of slowdown: InstAcc, Dep

IV. Experimental Results – IDCT in DJPEG Wide performance variation for code from different compilers. Wide variation in STI impact too…

IV. Cycle Breakdown - IDCT in DJPEG I-cache miss is crucial in both ORCC and Intel. I-cache misses are reduced significantly after disabling loop unrolling.

IV. Experimental Results – Overall CJPEG App.

IV. Experimental Results - Overall DJPEG App.

IV. Experimental Results (cont.) • Speedup by STI • Procedure speedup up to 18%. • Application speedup up to 11%. • STI does not always improve the performance. • Limited Itanium I-cache is a major bottleneck. • Compiler variations • ‘Good’ compilers – compilers other than GCC – have many optimization features. (e.g. speculation, predication) • Number of instructions are greater than that of GCC. • Absolute performance and speedup by STI is bigger. • But more susceptible to code size limitation. • Carefully apply the optimizations like loop unrolling.

V. Conclusions and Future Work • Summary • Developed STI technique for converting abundant TLP to ILP on VLIW/EPIC architectures • Introduced static and dynamic execution models • Demonstrated potential for significant amount of performance improvement by STI. • Relevant to high-end embedded processors with ILP support running multimedia applications. • Future Work • Extend proposed methodology to various threads • Examine the performance with other realistic workloads • Develop a tool to automate integration process at appropriate level • Build a detailed model and algorithm for a dynamic approach alex_dean@ncsu.edu ------ www.cesr.ncsu.edu/agdean

Procedure Cloning and Integration for Converting Parallelism from Coarse to Fine Grain

Procedure Cloning and Integration for Converting Parallelism from Coarse to Fine Grain

Presentation Transcript

Coarse-to-Fine Combinatorial Matching for Dense Isometric Shape Correspondence

Enhancing Fine-Grained Parallelism

Sparse representation for coarse and fine object recognition

Enhancing Fine-Grained Parallelism

Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

Fine Grain MPI

Fine-Grain Communication

Fine Grain Entities Recognition

Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism

COARSE GRAINS : GRAIN SORGHUM OATS BARLEY

Fine-Grain Parallelism

Snoop Filtering and Coarse-Grain Memory Tracking

The Expandable Split Window Paradigm for Exploiting Fine-Grain Parallelism

Creating Coarse-grained Parallelism for Loop Nests

Design Space Exploration for a Coarse Grain Accelerator

Operating System Support for Fine-Grain Parallelism on Multicore Architectures

Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Acurate determination of parameters for coarse grain model

Evaluation of Fracture toughness of fine- and coarse-grain graphite

Coarse-to-Fine Efficient Viterbi Parsing

Enhancing Fine-Grained Parallelism