1 / 46

Processor Architectures and Program Mapping

Processor Architectures and Program Mapping. Exploiting ILP part 2: code generation. TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman. Overview. Enhance performance: architecture methods Instruction Level Parallelism VLIW Examples C6 TM TTA Clustering Code generation

Télécharger la présentation

Processor Architectures and Program Mapping

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Processor Architectures and Program Mapping Exploiting ILP part 2: code generation TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman

  2. Overview • Enhance performance: architecture methods • Instruction Level Parallelism • VLIW • Examples • C6 • TM • TTA • Clustering • Code generation • Hands-on Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  3. Compiler basics • Overview • Compiler trajectory / structure / passes • Control Flow Graph (CFG) • Mapping and Scheduling • Basic block list scheduling • Extended scheduling scope • Loop schedulin Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  4. Compiler basics: trajectory Source program Preprocessor Compiler Error messages Assembler Library code Loader/Linker Object program Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  5. Compiler basics:structure / passes Source code Lexical analyzer token generation check syntax check semantic parse tree generation Parsing Intermediate code data flow analysis local optimizations global optimizations Code optimization code selection peephole optimizations Code generation making interference graph graph coloring spill code insertion caller / callee save and restore code Register allocation Sequential code Scheduling and allocation exploiting ILP Object code Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  6. := id + id * id 60 Compiler basics: structure Simple compilation example position := initial + rate * 60 Lexical analyzer temp1 := intoreal(60) temp2 := id3 * temp1 temp3 := id2 + temp2 id1 := temp3 id := id + id * 60 Syntax analyzer Code optimizer temp1 := id3 * 60.0 id1 := id2 + temp1 Code generator movf id3, r2 mulf #60, r2, r2 movf id2, r1 addf r2, r1 movf r1, id1 Intermediate code generator Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  7. Compiler basics:Control flow graph (CFG) C input code: if (a > b) { r = a % b; } else { r = b % a; } 1 sub t1, a, b bgz t1, 2, 3 CFG: 2 rem r, a, b goto 4 3 rem r, b, a goto 4 4 ………….. ………….. Program, is collection of Functions, each function is collection of Basic Blocks, each BB contains set of Instructions, each instruction consists of several Transports,.. Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  8. Mapping / Scheduling: placing operations in space and time d = a * b; e = a + d; f = 2 * b + d; r = f – e; x = z + y; a b 2 * * d z y + + + e f - x r Data Dependence Graph (DDG) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  9. How to map these operations? • Architecture constraints: • One Function Unit • All operations single cycle latency b a 2 * * d cycle + + z 1 y * e f + 2 * - 3 x + r 4 + 5 - 6 + Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  10. b a 2 * * d + + z y e f + - x r How to map these operations? • Architecture constraints: • One Add-sub and one Mul unit • All operations single cycle latency Mul Add-sub cycle 1 * + 2 * + 3 + 4 - 5 6 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  11. x Pareto curve (solution space) x x x T execution x x x x x x x x x x x x x x x x x x x x x x x x x x x x 0 Cost There are many mapping solutions Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  12. Basic Block Scheduling • Make a dependence graph • Determine minimal length • Determine ASAP, ALAP, and slack of each operation • Place each operation in first cycle with sufficient resources Note: • Scheduling order sequential • Priority determined by used heuristic; e.g. slack Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  13. Basic Block Scheduling ASAP cycle B C ALAP cycle ADD A slack <1,1> A C SUB <2,2> ADD NEG LD <3,3> <1,3> <2,3> A B LD MUL ADD <4,4> <2,4> <1,4> z y X Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  14. Cycle based list scheduling proc Schedule(DDG = (V,E)) beginproc ready = { v | (u,v)  E } ready’ = ready sched =  current_cycle = 0 whilesched  Vdo for eachv  ready’ do ifResourceConfl(v,current_cycle, sched) then cycle(v) = current_cycle sched = sched  {v} endif endfor current_cycle = current_cycle + 1 ready = { v | v  sched  (u,v) E, u  sched } ready’ = { v | v  ready  (u,v) E, cycle(u) + delay(u,v) current_cycle} endwhile endproc Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  15. A a) add r4, r4, 4 b)beq . . . B c) add r1, r1, r2 C d) sub r1, r1, r2 D e) st r1, 8(r4) Extended basic block scheduling:Code Motion • Downward code motions? • — a  B, a  C, a  D, c  D, d  D • Upward code motions? • — c  A, d  A, e  B, e  C, e  A Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  16. Extended Scheduling scope Code: CFG: Control Flow Graph A; If cond Then B Else C; D; If cond Then E Else F; G; A B C D E F G Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  17. Scheduling scopes Trace Superblock Decision tree Hyperblock/region Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  18. Code movement (upwards) within regions destination block Legend: Copy needed I I Intermediate block I I Check for off-liveness Code movement I add source block Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  19. A B C D E F Extended basic block scheduling:Code Motion • A dominates B  A is always executed before B • Consequently: • A does not dominate B  code motion from B to A requires code duplication • B post-dominates A  B is always executed after A • Consequently: • B does not post-dominate A  code motion from B to A is speculative Q1: does C dominate E? Q2: does C dominate D? Q3: does F post-dominate D? Q4: does D post-dominate B? Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  20. Scheduling: Loops Loop Optimizations: A C B A A D C B C B C’ C’ C’’ C’’ D D Loop unrolling Loop peeling Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  21. Scheduling: Loops • Problems with unrolling: • Exploits only parallelism within sets of n iterations • Iteration start-up latency • Code expansion Basic block scheduling Basic block scheduling and unrolling resource utilization Software pipelining time Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  22.  LD LD ML LD ML ST ML ST ST Example: y = a.x LD ML ST Software pipelining • Software pipelining a loop is: • Scheduling the loop such that iterations start before preceding iterations have finished Or: • Moving operations across the backedge LD LD ML LD ML ST ML ST ST Unroling 5/3 cycles/iteration Software pipelining 1 cycle/iteration 3 cycles/iteration Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  23. Software pipelining (cont’d) Basic techniques: • Modulo scheduling (Rau, Lam) • list scheduling with modulo resource constraints • Kernel recognition techniques • unroll the loop • schedule the iterations • identify a repeating pattern • Examples: • Perfect pipelining (Aiken and Nicolau) • URPR (Su, Ding and Xia) • Petri net pipelining (Allan) • Enhanced pipeline scheduling (Ebcioğlu) • fill first cycle of iteration • copy this instruction over the backedge Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  24. ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) for (i = 0; i < n; i++) a[i+6] = 3* a[i] - 1; (a) Example loop (b) Code without loop control Software pipelining: Modulo scheduling Example: Modulo scheduling a loop ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) Prologue ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) Kernel ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) Epilogue (c) Software pipeline • Prologue fills the SW pipeline with iterations • Epilogue drains the SW pipeline Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  25. ld r1, (r2) (0,1) (1,0) (delay, distance) mul r3, r1, 3 (1,6) (0,1) (1,0) sub r4, r3, 1 (0,1) (1,0) st r4, (r5) Software pipelining: determine II, Initation Interval Cyclic data dependences For (i=0;.....) A[i+6]= 3*A[i]-1 cycle(v) cycle(u) + delay(u,v) - II.distance(u,v) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  26. Resources: Cycles: Therefore: Or: Modulo scheduling constraints MII minimum initiation interval bounded by cyclic dependences and resources: MII = max{ ResMII, RecMII } Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  27. 9 steps required to translate an HLL program Front-end compilation Determine dependencies Graph partitioning: make multiple threads (or tasks) Bind partitions to compute nodes Bind operands to locations Bind operations to time slots: Scheduling Bind operations to functional units Bind transports to buses Execute operations and perform transports The Role of the Compiler Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  28. Division of responsibilities between hardware and compiler Application Frontend Superscalar Determine Dependencies Determine Dependencies Dataflow Binding of Operands Binding of Operands Multi-threaded Scheduling Scheduling Indep. Arch Binding of Operations Binding of Operations VLIW Binding of Transports Binding of Transports TTA Execute Responsibility of compiler Responsibility of Hardware Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  29. Overview • Enhance performance: architecture methods • Instruction Level Parallelism • VLIW • Examples • C6 • TM • TTA • Clustering • Code generation • Hands-on Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  30. Hands-on (not this year) • Map JPEG to a TTA processor • see web page: http://www.ics.ele.tue.nl/~heco/courses/pam • Install TTA tools (compiler and simulator) • Go through all listed steps • Perform DSE: design space exploration • Add SFU • 1 or 2 page report in 2 weeks Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  31. Hands-on • Let’s look at DSE: Design Space Exploration • We will use the Imagine processor • http://cva.stanford.edu/projects/imagine/ Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  32. x Pareto curve (solution space) x x x exec. time x x x x x x x x x x x x x x x x cost Mapping applications to processorsMOVE framework User intercation Optimizer Architecture parameters feedback feedback Parametric compiler Hardware generator Move framework Parallel object code chip TTA based system Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  33. Code generation trajectory for TTAs • Frontend: • GCC or SUIF • (adapted) Application (C) Compiler frontend Sequential code Sequential simulation Input/Output Architecture description Compiler backend Profiling data Parallel code Parallel simulation Input/Output Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  34. Exploration: TTA resource reduction Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  35. Exporation: TTA connectivity reduction Critical connections disappear Reducing bus delay Execution time FU stage constrains cycle time 0 Number of connections removed Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  36. Execution time Cost Can we do better Yes !! How ? • Transformations • SFUs: Special Function Units • Multiple Processors Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  37. + + + Transforming the specification + + + Based on associativity of + operation a + (b + c) = (a + b) + c Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  38. Transforming the specification r = 2*b – a; x = z + y; d = a * b; e = a + d; f = 2 * b + d; r = f – e; x = z + y; 1 b y z a << + - x r Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  39. + + + Changing the architectureadding SFUs: special function units + + + 4-input adder why is this faster? Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  40. Changing the architectureadding SFUs: special function units In the extreme case put everything into one unit! Spatial mapping - no control flow However: no flexibility / programmability !! Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  41. SFUs: fine grain patterns • Why using fine grain SFUs: • Code size reduction • Register file #ports reduction • Could be cheaper and/or faster • Transport reduction • Power reduction (avoid charging non-local wires) • Supports whole application domain ! Which patterns do need support? • Detection of recurring operation patterns needed Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  42. SFUs: covering results Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  43. stream input 4 Addercmp FUs 2 Multiplier FUs stream output 2 Diffadd FUs 4 RFs 9 buses Exploration: resulting architecture • Architecture for image processing • Note the reduced connectivity Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  44. Conclusions • Billions of embedded processing systems • how to design these systems quickly, cheap, correct, low power,.... ? • what will their processing platform look like? • VLIWs are very powerful and flexible • can be easily tuned to application domain • TTAs even more flexible, scalable, and lower power Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  45. Conclusions • Compilation for ILP architectures is getting mature, and • Enters the commercial area. • However • Great discrepancy between available and exploitable parallelism • Advanced code scheduling techniques needed to exploit ILP Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  46. Bottom line: Do not pay for hardware if you can do it by software !! Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

More Related