ASCI Winterschool on Embedded Systems March 2004 Renesse

ASCI Winterschool on Embedded SystemsMarch 2004Renesse Compilers with emphasis on ILP compilation Henk Corporaal Peter Knijnenburg

Compiling for ILP Architectures Overview: • Motivation and Goals • Measuring and exploiting available parallelism • Compiler basics • Scheduling for ILP architectures • Source level transformations • Compilation frameworks • Summary and Conclusions

Motivation • Performance requirements increase • Applications may contain much instruction level parallelism • Processors offer lots of hardware concurrency Problem to be solved: • how to exploit this concurrency automatically?

Goals of code generation • High speedup • Exploit all the hardware concurrency • Extract all application parallelism • obey true dependencies only • No code rewriting: automatic parallelization • However: application tuning may be required • Limit code expansion

Overview • Motivation and Goals • Measuring and exploiting available parallelism • Compiler basics • Scheduling for ILP architectures • Source level transformations • Compilation frameworks • Summary and Conclusions

Measuring and exploiting available parallelism • How to measure parallelism within applications? • Using existing compiler • Using trace analysis • Track all the real data dependencies (RaWs) of instructions from issue window • register dependence • memory dependence • Check for correct branch prediction • if prediction correct continue • if wrong, flush schedule and start in next cycle

Measuring and exploiting available parallelism • Different effects reduce the exploitable parallelism: • Reducing window size • i.e., the number of instructions to choose from • Non-perfect branch prediction • perfect (oracle model) • dynamic predictor (e.g. 2 bit prediction table with finite number of entries) • static prediction (using profiling) • no prediction • Restricted number of registers for renaming • typical superscalars have O(100) registers • Restricted number of other resources, like FUs

Measuring and exploiting available parallelism • Different effects reduce the exploitable parallelism (cont’d): • Non-perfect alias analysis (memory disambiguation)Models to use: • perfect • inspection: no dependence in following cases: r1 := 0(r9) r1 := 0(fp) 4(r9) := r2 0(gp) := r2 A more advanced analysis may disambiguate most stack and global references, but not the heap references • none • Important: good branch prediction, 128 registers for renaming, alias analysis on stack and global accesses, and for FP a large window size

Measuring and exploiting available parallelism • How much parallelism is there in real programs? Used compiler model: Limited: Look within basic blocks only Real: Inter basic block scheduling ILP compiler Oracle-a: Trace analysis, within functions only Oracle-b: Trace analysis, within whole program

Conclusions • Amount of parallelism is limited • higher in Multi-Media • higher in kernels • Trace analysis detects all types of parallelism • task, data and operation types • Detected parallelism depends on • quality of compiler • hardware • source-code transformations

Compiler basics • Overview • Compiler trajectory / structure / passes • Abstract Syntax Tree (AST) • Control Flow Graph (CFG) • Basic optimizations • Register allocation • Code selection

Compiler basics: trajectory Source program Preprocessor Compiler Error messages Assembler Library code Loader/Linker Object program

Compiler basics:structure / passes Source code Lexical analyzer token generation check syntax check semantic parse tree generation Parsing Intermediate code data flow analysis local optimizations global optimizations Code optimization code selection peephole optimizations Code generation making interference graph graph coloring spill code insertion caller / callee save and restore code Register allocation Sequential code Scheduling and allocation exploiting ILP Object code

:= id + id * id 60 Compiler basics: structure Simple compilation example position := initial + rate * 60 Lexical analyzer temp1 := intoreal(60) temp2 := id3 * temp1 temp3 := id2 + temp2 id1 := temp3 id := id + id * 60 Syntax analyzer Code optimizer temp1 := id3 * 60.0 id1 := id2 + temp1 Code generator movf id3, r2 mulf #60, r2, r2 movf id2, r1 addf r2, r1 movf r1, id1 Intermediate code generator

FORTRAN C FORTRAN to C pre-processing C front-end FORTRAN specific transformations converting non-standard structures to SUIF constant propagation forward propagation high-SUIF to low-SUIF induction variable identification constant propagation scalar privatization analysis strength reduction reduction analysis dead-code elimination locality optimization and parallelism analysis register allocation parallel code generation assembly code generation SUIF to text SUIF to postscript SUIF to C postscript SUIF text assembly code C Compiler basics: structure - SUIF-1 toolkit example

Compiler basics:Abstract Syntax Tree (AST) C input code: Parse tree: ‘infinite’ nesting: if (a > b) { r = a % b; } else { r = b % a; } Stat IF Cmp > Var a Var b Statlist Stat Expr Assign Var r Binop % Var a Var b Statlist Stat Expr Assign Var r Binop % Var b Var a

Compiler basics:Control flow graph (CFG) C input code: if (a > b) { r = a % b; } else { r = b % a; } 1 sub t1, a, b bgz t1, 2, 3 CFG: 2 rem r, a, b goto 4 3 rem r, b, a goto 4 4 ………….. ………….. Program, is collection of Functions, each function is collection of Basic Blocks, each BB contains set of Instructions, each instruction consists of several Transports,..

Compiler basics: Basic optimizations • Machine independent optimizations • Machine dependent optimizations

Compiler basics: Basic optimizations • Machine independent optimizations • Common subexpression elimination • Constant folding • Copy propagation • Dead-code elimination • Induction variable elimination • Strength reduction • Algebraic identities • Commutative expressions • Associativity: Tree height reduction • Note: not always allowed(due to limited precision)

Compiler basics: Basic optimizations • Machine dependent optimization example What’s the optimal implementation of a*34 ? • Use multiplier: mul Tb, Ta, 34 • Pro: No thinking required • Con: May take many cycles • Alternative: SHL Tc, Ta, 1 ADD Tb, Tc, Tzero SHL Tc, Tc, 4 ADD Tb, Tb, Tc • Pros: May take fewer cycles • Cons: • Uses more registers • Additional instructions ( I-cache load / code size)

Compiler basics: Register allocation • Register Organization Conventions needed for parameter passing and register usage across function calls r31 Callee saved registers r21 Caller saved registers r20 Temporaries r11 r10 Argument and result transfer r1 Hard-wired 0 r0

Register allocation using graph coloring Given a set of registers, what is the most efficient mapping of registers to program variables in terms of execution time of the program? • A variable is defined at a point in program when a value is assigned to it. • A variable is used at a point in a program when its value is referenced in an expression. • The live range of a variable is the execution range between definitions and uses of a variable.

Program: • a := • c := • b := • := b • d := • := a • := c • := d a b c d Register allocation using graph coloring Example: Live Ranges

a a b c b c d d Register allocation using graph coloring Inference Graph • Coloring: • a = red • b = green • c = blue • d = green Graph needs 3 colors => program needs 3 registers

Program: • a := • c := • store c • b := • := b • d := • := a • load c • := c • := d Live Ranges a b c d Register allocation using graph coloring Spill/ Reload code Spill/ Reload code is needed when there are not enough colors (registers) to color the interference graph Example: Only two registers available !!

Register allocation for a monolithic RF Scheme of the optimistic register allocator Spill code Renumber Build Spill costs Simplify Select The Select phase selects a color (= machine register) for a variable that minimizes the heuristic: h1 = fdep(col, var) + caller_callee(col, var) where: fdep(col, var) : a measure for the introduction of false dependencies caller_callee(col, var) : cost for mapping var on a caller or callee saved register

Compiler basics: Code selection • CISC era • Code size important • Determine shortest sequence of code • Many options may exist • Pattern matching Example M68029: D1 := D1 + M[ M[10+A1] + 16*D2 + 20 ]  ADD ([10,A1], D2*16, 20) D1 • RISC era • Performance important • Only few possible code sequences • New implementations of old architectures optimize RISC part of instruction set only; for e.g. i486 / Pentium / M68020

What is scheduling? • Time allocation: • Assigning instructions or operations to time slots • Preserve dependences: • Register dependences • Memory dependences • Optimize code with respect to performance/ code size/ power consumption/ .. In practice scheduling may also integrate allocation of resources: • Space allocation (satisfy resource constraints): • Bind operations to FUs • Bind variables to registers/ register files • Bind transports to buses

Why scheduling? Let’s look at the execution time: Texecution = Ncycles x Tcycle = Ninstructions x CPI x Tcycle Scheduling may reduce Texecution • Reduce CPI (cycles per instruction) • early scheduling of long latency operations • avoid pipeline stalls due to structural, data and control hazards • allow Nissue > 1 and therefore CPI < 1 • Reduce Ninstructions • compact many operations into each instruction (VLIW)

IF ID OF EX WB IF ID OF EX WB IF ID OF EX WB IF ID OF EX WB IF ID OF EX WB Scheduling: Structural hazards time Basic pipelining diagram: • IF: instruction fetch • ID: instruction decode • OF: operand fetch • EX: execute • WB: write back Instruction stream Pipeline stalls due to lack of resources: load time IF ID OF EX WB empty pipeline stage IF ID OF EX WB Instruction stream IF ID OF EX EX EX WB IF ID OF EX WB IF ID OF EX WB Shared memory port One FU

Scheduling: Data dependences Three types: RaW, WaR and WaW Examples: add r1, r2, 5 ; r1 := r2+5 sub r4, r1, r3 ; RaW of r1 add r1, r2, 5 sub r2, r4, 1 ; WaR of r2 add r1, r2, 5 sub r1, r1, 1 ; WaW of r1 st r1, 5(r2) ; M[r2+5] := r1 ld r5, 0(r4) ; RaW if 5+r2 = 0+r4 WaW and WaR can be solved through renaming !!

OF OF EX EX WB WB IF IF ID ID OF EX WB IF ID Scheduling: RaW dependence add r1, r2, 5 ;r1:= r2+5 sub r4, r1, r3 ;RaW of r1 Without bypass circuitry time add r1, r2, 5 sub r4, r1, r3 OF EX WB IF ID With bypass circuitry time add r1, r2, 5 Saves two cycles sub r4, r1, r3

ALU Scheduling: RaW dependence Bypassing circuitry: buf from register file buf to register file from register file buf

Scheduling:RaW dependence Unscheduled code: Lw R1,b Lw R2,c Add R3,R1,R2 interlock Sw a,R3 Lw R1,e Lw R2,f Sub R4,R1,R2 interlock Sw d,R4 Scheduled code: Lw R1,b Lw R2,c Lw R5,e extra reg. needed! Add R3,R1,R2 Lw R2,f Sw a,R3 Sub R4,R5,R2 Sw d,R4 Avoiding RaW stalls: Reordering of instructions by the compiler Example: avoiding one-cycle load interlock Code: a = b + c d = e - f

time IF ID OF EX Branch L IF ID OF EX WB Predict not taken IF ID OF EX WB IF ID OF EX WB IF ID OF EX WB L: Scheduling: Control hazards Branch requires 3 actions: • Compute new address • Determine condition • Perform the actual branch (if taken): PC := new address

Control hazards: what's the penalty? CPI = CPIideal + fbranch x Pbranch Pbranch = Ndelayslots x miss_rate • Superscalars tend to have large branch penalty Pbranch due to many pipeline stages • Note that penalties have larger effect when CPIideal is low

Scheduling: Control hazards • What can we do about control hazards and CPI penalty? • Keep penalty Pbranch low: • Early computation of new PC • Early determination of condition • Visible delay slots filled by compiler (MIPS) • Branch prediction • Reduce control dependencies (control height reduction) [Schlansker and Kathail, Micro’95] • Remove branches: if-conversion • Conditional instructions: CMOVE, cond skip next • Guarding all instructions: TriMedia

Scheduling: Control height reduction • Reduce the number of branches (control height) along a trace [Schlansker and Kathail, Micro’95] • Problems with stores: • May not move above branches

store 0 c0 exit 0 branch 0 store 1 c1 exit 1 branch 1 store 2 c2 exit 2 branch 2 store 3 c3 exit 3 branch 3 fall-through Scheduling: Control height reduction Original code:

c0 c1 c2 c3 to off-trace code store 0 c0 exit 0 FT branch branch 0 store 1 store 1 c1 exit 1 branch 1 store 2 store 2 store 3 c2 exit 2 branch 2 on-trace code fall-through store 3 exit 3 off-trace code Scheduling: Control height reduction New code: Note that stores 1-3 may also be guarded; this eliminated the branch latency altogether along the on-trace path

Scheduling: Conditional instructions After conversion: • Example: Cmove (supported by Alpha) If (A=0) S = T; assume: r1: A, r2: S, r3: T Object code: Bnez r1, L Mov r2, r3 L: . . . . Cmovz r2, r3, r1

Scheduling: Conditional instructions Conditional instructions are useful, however: • Squashed instructions still take execution time and execution resources • Consequence: long target blocks can not be if-converted • Condition has to be known early • Moving operations across multiple branches requires complicated predicates • Compatibility: change of ISA (instruction set architecture) Practice: • Current superscalars support a limited set of conditional instructions • CMOVE: alpha, MIPS, PowerPC, SPARC • HP PA: any RR instruction can conditionally squash next instruction Large VLIWs profit from making all instructions conditional • guarded execution: TriMedia, Intel/HP IA-64, TI C6x

Scheduling: Conditional instructions Full guard support If-conversion of conditional code Assume: • tbranch branch latency. • pbranch branching probability. • ttrue execution time of the TRUE branch. • tfalse execution time of the FALSE branch. Execution times of original and if-converted-code for non-ILP architecture: toriginal_code = (1 +pbranch) xtbranch + pxttrue + (1 - pbranch) xtfalse tif_converted_code = ttrue + tfalse

Scheduling: Conditional instructions Speedup of if-converted code for non-ILP architectures Only interesting for short target blocks

Scheduling: Conditional instructions Speedup of if-converted code for ILP architectures with sufficient resources tif_converted = max(ttrue, tfalse) Much larger area of interest !!

Scheduling: Conditional instructions • Full guard support for large ILP architectures has a number of advantages: • Removing unpredictable branches • Enlarging scheduling scope • Enabling software pipelining • Enhancing code motion when speculation is not allowed • Resource sharing; even when speculation is allowed guarding may be profitable

Scheduling: Overview Transforming a sequential program into a parallel program: read sequential program read machine description file for each procedure do perform function inlining for each procedure do transform an irreducible CFG into a reducible CFG perform control flow analysis perform loop unrolling perform data flow analysis perform memory reference disambiguation perform register allocation for each scheduling scope do perform instruction scheduling write parallel program

Scheduling: Int.Lin.Programming Integer linear programming scheduling method • Introduce: • Decision variables: xi,j = 1 if operation i is scheduled in cycle j • Constraints like: • Limited resources: where xtoperation of type t and Mtnumber of resources of type t • Data dependence constraints • Timing constraints • Problem: too many decision variables

ASCI Winterschool on Embedded Systems March 2004 Renesse