Processor Architectures and Program Mapping

Processor Architectures and Program Mapping Exploiting ILP part 2: code generation TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman

Overview • Enhance performance: architecture methods • Instruction Level Parallelism • VLIW • Examples • C6 • TM • TTA • Clustering • Code generation • Hands-on Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Compiler basics • Overview • Compiler trajectory / structure / passes • Control Flow Graph (CFG) • Mapping and Scheduling • Basic block list scheduling • Extended scheduling scope • Loop schedulin Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Compiler basics: trajectory Source program Preprocessor Compiler Error messages Assembler Library code Loader/Linker Object program Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Compiler basics:structure / passes Source code Lexical analyzer token generation check syntax check semantic parse tree generation Parsing Intermediate code data flow analysis local optimizations global optimizations Code optimization code selection peephole optimizations Code generation making interference graph graph coloring spill code insertion caller / callee save and restore code Register allocation Sequential code Scheduling and allocation exploiting ILP Object code Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

:= id + id * id 60 Compiler basics: structure Simple compilation example position := initial + rate * 60 Lexical analyzer temp1 := intoreal(60) temp2 := id3 * temp1 temp3 := id2 + temp2 id1 := temp3 id := id + id * 60 Syntax analyzer Code optimizer temp1 := id3 * 60.0 id1 := id2 + temp1 Code generator movf id3, r2 mulf #60, r2, r2 movf id2, r1 addf r2, r1 movf r1, id1 Intermediate code generator Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Compiler basics:Control flow graph (CFG) C input code: if (a > b) { r = a % b; } else { r = b % a; } 1 sub t1, a, b bgz t1, 2, 3 CFG: 2 rem r, a, b goto 4 3 rem r, b, a goto 4 4 ………….. ………….. Program, is collection of Functions, each function is collection of Basic Blocks, each BB contains set of Instructions, each instruction consists of several Transports,.. Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Mapping / Scheduling: placing operations in space and time d = a * b; e = a + d; f = 2 * b + d; r = f – e; x = z + y; a b 2 * * d z y + + + e f - x r Data Dependence Graph (DDG) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

How to map these operations? • Architecture constraints: • One Function Unit • All operations single cycle latency b a 2 * * d cycle + + z 1 y * e f + 2 * - 3 x + r 4 + 5 - 6 + Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

b a 2 * * d + + z y e f + - x r How to map these operations? • Architecture constraints: • One Add-sub and one Mul unit • All operations single cycle latency Mul Add-sub cycle 1 * + 2 * + 3 + 4 - 5 6 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

x Pareto curve (solution space) x x x T execution x x x x x x x x x x x x x x x x x x x x x x x x x x x x 0 Cost There are many mapping solutions Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Basic Block Scheduling • Make a dependence graph • Determine minimal length • Determine ASAP, ALAP, and slack of each operation • Place each operation in first cycle with sufficient resources Note: • Scheduling order sequential • Priority determined by used heuristic; e.g. slack Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Basic Block Scheduling ASAP cycle B C ALAP cycle ADD A slack <1,1> A C SUB <2,2> ADD NEG LD <3,3> <1,3> <2,3> A B LD MUL ADD <4,4> <2,4> <1,4> z y X Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Cycle based list scheduling proc Schedule(DDG = (V,E)) beginproc ready = { v | (u,v)  E } ready’ = ready sched =  current_cycle = 0 whilesched  Vdo for eachv  ready’ do ifResourceConfl(v,current_cycle, sched) then cycle(v) = current_cycle sched = sched  {v} endif endfor current_cycle = current_cycle + 1 ready = { v | v  sched  (u,v) E, u  sched } ready’ = { v | v  ready  (u,v) E, cycle(u) + delay(u,v) current_cycle} endwhile endproc Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

A a) add r4, r4, 4 b)beq . . . B c) add r1, r1, r2 C d) sub r1, r1, r2 D e) st r1, 8(r4) Extended basic block scheduling:Code Motion • Downward code motions? • — a  B, a  C, a  D, c  D, d  D • Upward code motions? • — c  A, d  A, e  B, e  C, e  A Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Extended Scheduling scope Code: CFG: Control Flow Graph A; If cond Then B Else C; D; If cond Then E Else F; G; A B C D E F G Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Scheduling scopes Trace Superblock Decision tree Hyperblock/region Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Code movement (upwards) within regions destination block Legend: Copy needed I I Intermediate block I I Check for off-liveness Code movement I add source block Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

A B C D E F Extended basic block scheduling:Code Motion • A dominates B  A is always executed before B • Consequently: • A does not dominate B  code motion from B to A requires code duplication • B post-dominates A  B is always executed after A • Consequently: • B does not post-dominate A  code motion from B to A is speculative Q1: does C dominate E? Q2: does C dominate D? Q3: does F post-dominate D? Q4: does D post-dominate B? Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Scheduling: Loops Loop Optimizations: A C B A A D C B C B C’ C’ C’’ C’’ D D Loop unrolling Loop peeling Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Scheduling: Loops • Problems with unrolling: • Exploits only parallelism within sets of n iterations • Iteration start-up latency • Code expansion Basic block scheduling Basic block scheduling and unrolling resource utilization Software pipelining time Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  LD LD ML LD ML ST ML ST ST Example: y = a.x LD ML ST Software pipelining • Software pipelining a loop is: • Scheduling the loop such that iterations start before preceding iterations have finished Or: • Moving operations across the backedge LD LD ML LD ML ST ML ST ST Unroling 5/3 cycles/iteration Software pipelining 1 cycle/iteration 3 cycles/iteration Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Software pipelining (cont’d) Basic techniques: • Modulo scheduling (Rau, Lam) • list scheduling with modulo resource constraints • Kernel recognition techniques • unroll the loop • schedule the iterations • identify a repeating pattern • Examples: • Perfect pipelining (Aiken and Nicolau) • URPR (Su, Ding and Xia) • Petri net pipelining (Allan) • Enhanced pipeline scheduling (Ebcioğlu) • fill first cycle of iteration • copy this instruction over the backedge Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) for (i = 0; i < n; i++) a[i+6] = 3* a[i] - 1; (a) Example loop (b) Code without loop control Software pipelining: Modulo scheduling Example: Modulo scheduling a loop ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) Prologue ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) Kernel ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) Epilogue (c) Software pipeline • Prologue fills the SW pipeline with iterations • Epilogue drains the SW pipeline Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

ld r1, (r2) (0,1) (1,0) (delay, distance) mul r3, r1, 3 (1,6) (0,1) (1,0) sub r4, r3, 1 (0,1) (1,0) st r4, (r5) Software pipelining: determine II, Initation Interval Cyclic data dependences For (i=0;.....) A[i+6]= 3*A[i]-1 cycle(v) cycle(u) + delay(u,v) - II.distance(u,v) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Resources: Cycles: Therefore: Or: Modulo scheduling constraints MII minimum initiation interval bounded by cyclic dependences and resources: MII = max{ ResMII, RecMII } Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

9 steps required to translate an HLL program Front-end compilation Determine dependencies Graph partitioning: make multiple threads (or tasks) Bind partitions to compute nodes Bind operands to locations Bind operations to time slots: Scheduling Bind operations to functional units Bind transports to buses Execute operations and perform transports The Role of the Compiler Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Division of responsibilities between hardware and compiler Application Frontend Superscalar Determine Dependencies Determine Dependencies Dataflow Binding of Operands Binding of Operands Multi-threaded Scheduling Scheduling Indep. Arch Binding of Operations Binding of Operations VLIW Binding of Transports Binding of Transports TTA Execute Responsibility of compiler Responsibility of Hardware Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Overview • Enhance performance: architecture methods • Instruction Level Parallelism • VLIW • Examples • C6 • TM • TTA • Clustering • Code generation • Hands-on Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Hands-on (not this year) • Map JPEG to a TTA processor • see web page: http://www.ics.ele.tue.nl/~heco/courses/pam • Install TTA tools (compiler and simulator) • Go through all listed steps • Perform DSE: design space exploration • Add SFU • 1 or 2 page report in 2 weeks Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Hands-on • Let’s look at DSE: Design Space Exploration • We will use the Imagine processor • http://cva.stanford.edu/projects/imagine/ Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

x Pareto curve (solution space) x x x exec. time x x x x x x x x x x x x x x x x cost Mapping applications to processorsMOVE framework User intercation Optimizer Architecture parameters feedback feedback Parametric compiler Hardware generator Move framework Parallel object code chip TTA based system Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Code generation trajectory for TTAs • Frontend: • GCC or SUIF • (adapted) Application (C) Compiler frontend Sequential code Sequential simulation Input/Output Architecture description Compiler backend Profiling data Parallel code Parallel simulation Input/Output Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Exploration: TTA resource reduction Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Exporation: TTA connectivity reduction Critical connections disappear Reducing bus delay Execution time FU stage constrains cycle time 0 Number of connections removed Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Execution time Cost Can we do better Yes !! How ? • Transformations • SFUs: Special Function Units • Multiple Processors Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

+ + + Transforming the specification + + + Based on associativity of + operation a + (b + c) = (a + b) + c Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Transforming the specification r = 2*b – a; x = z + y; d = a * b; e = a + d; f = 2 * b + d; r = f – e; x = z + y; 1 b y z a << + - x r Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

+ + + Changing the architectureadding SFUs: special function units + + + 4-input adder why is this faster? Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Changing the architectureadding SFUs: special function units In the extreme case put everything into one unit! Spatial mapping - no control flow However: no flexibility / programmability !! Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

SFUs: fine grain patterns • Why using fine grain SFUs: • Code size reduction • Register file #ports reduction • Could be cheaper and/or faster • Transport reduction • Power reduction (avoid charging non-local wires) • Supports whole application domain ! Which patterns do need support? • Detection of recurring operation patterns needed Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

SFUs: covering results Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

stream input 4 Addercmp FUs 2 Multiplier FUs stream output 2 Diffadd FUs 4 RFs 9 buses Exploration: resulting architecture • Architecture for image processing • Note the reduced connectivity Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Conclusions • Billions of embedded processing systems • how to design these systems quickly, cheap, correct, low power,.... ? • what will their processing platform look like? • VLIWs are very powerful and flexible • can be easily tuned to application domain • TTAs even more flexible, scalable, and lower power Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Conclusions • Compilation for ILP architectures is getting mature, and • Enters the commercial area. • However • Great discrepancy between available and exploitable parallelism • Advanced code scheduling techniques needed to exploit ILP Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Bottom line: Do not pay for hardware if you can do it by software !! Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Processor Architectures and Program Mapping