1 / 53

Topic 2: Foundamentals in Basic Back-End Optimization

Topic 2: Foundamentals in Basic Back-End Optimization. Instruction Selection Instruction scheduling Register allocation. ABET Outcome.

joyce
Télécharger la présentation

Topic 2: Foundamentals in Basic Back-End Optimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Topic 2: Foundamentals in Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation \course\cpeg421-10F\Topic-2.ppt

  2. ABET Outcome • Ability to apply knowledge of basic code generation techniques, e.g. Instruction selection, instruction scheduling, register allocation, to solve code generation problems. • Ability to analyze the basic algorithms on the above techniques and conduct experiments to show their effectiveness. • Ability to use a modern compiler development platform and tools for the practice of above. • A Knowledge on contemporary issues on this topic. \course\cpeg421-10F\Topic-2.ppt

  3. General Compiler Framework Source • Good IPO • Good LNO • Good global optimization • Good integration of IPO/LNO/OPT • Smooth information passing between FE and CG • Complete and flexible support of inner-loop scheduling (SWP), instruction scheduling and register allocation Inter-Procedural Optimization (IPA) Loop Nest Optimization (LNO) Global Optimization (OPT) ME Global inst scheduling Innermost Loop scheduling Arch Models Reg alloc Local inst scheduling BE/CG Executable \course\cpeg421-10F\Topic-2.ppt

  4. Three Basic Back-End Optimization Instruction selection • Mapping IR into assembly code • Assume a fixed storage mapping & code shape • Combining operations, using address modes Instruction scheduling • Reordering operations to hide latencies • Assume a fixed program (set of operations) • Changes demand for registers Register allocation • Deciding which values will reside in registers • Changes the storage mapping may add false sharing • Concerns about placement of data & memory operations \course\cpeg421-10F\Topic-2.ppt

  5. Instruction Selection Some slides are from CS 640 lecture in George Mason University \course\cpeg421-10F\Topic-2.ppt

  6. Reading List (1) K. D. Cooper & L. Torczon, Engineering a Compiler, Chapter 11 (2) Dragon Book, Chapter 8.7, 8.9 Some slides are from CS 640 lecture in George Mason University \course\cpeg421-10F\Topic-2.ppt

  7. Objectives • Introduce the complexity and importance of instruction selection • Study practical issues and solutions • Case study: EBO (Extended Basic block Optimization) in Open64 \course\cpeg421-10F\Topic-2.ppt

  8. Instruction Selection: Retargetable Machine description should also help with scheduling & allocation Front End Back End Machine description Back-end Generator Tables Description-based retargeting Pattern Matching Engine Infrastructure Middle End This is simplistic but useful view \course\cpeg421-10F\Topic-2.ppt

  9. Complexity of Instruction Selection Modern computers have many ways to do things. Consider a register-to-register copy • Obvious operation is: mv rj, ri • Many others exist • add rj, ri,0 sub rj, ri, 0 rshiftI rj, ri, 0 mul rj, ri, 1 or rj, ri, 0 divI rj, r, 1 xor rj, ri, 0 others … \course\cpeg421-10F\Topic-2.ppt

  10. Complexity of Instruction Selection (Cont.) • Multiple addressing modes • Each alternate sequence has its cost • Complex ops (mult, div): several cycles • Memory ops: latency vary • Sometimes, cost is context related • Use under-utilized FUs • Dependent on objectives: speed, power, code size \course\cpeg421-10F\Topic-2.ppt

  11. Complexity of Instruction Selection (Cont.) • Additional constraints on specific operations • Load/store multiple words: contiguous registers • Multiply: need special register Accumulator • Interaction between instruction selection, instruction scheduling, and register allocation • For scheduling, instruction selection predetermines latencies and function units • For register allocation, instruction selection pre-colors some variables. e.g. non-uniform registers (such as registers for multiplication) \course\cpeg421-10F\Topic-2.ppt

  12. Instruction Selection Techniques Tree Pattern-Matching • Tree-oriented IR suggests pattern matching on trees • Tree-patterns as input, matcher as output • Each pattern maps to a target-machine instruction sequence • Use dynamic programming or bottom-up rewrite systems Peephole-based Matching • Linear IR suggests using some sort of string matching • Inspired by peephole optimization • Strings as input, matcher as output • Each string maps to a target-machine instruction sequence In practice, both work well; matchers are quite different. \course\cpeg421-10F\Topic-2.ppt

  13. A Simple Tree-Walk Code Generation Method • Assume starting with a Tree-like IR • Starting from the root, recursively walking through the tree • At each node use a simple (unique) rule to generate a low-level instruction \course\cpeg421-10F\Topic-2.ppt

  14. Tree Pattern-Matching • Assumptions • tree-like IR - an AST • Assume each subtree of IR – there is a corresponding set of tree patterns (or “operation trees” - low-level abstract syntax tree) • Problem formulation: Find a best mapping of the AST to operations by “tiling” the AST with operation trees (where tiling is a collection of (AST-node, operation-tree) pairs). \course\cpeg421-10F\Topic-2.ppt

  15. Tile AST An AST tree Tile 6 gets Tile 5 - + ref val num * Tile 4 Tile 1 ref num ref + + Tile 3 val num lab num Tile 2 \course\cpeg421-10F\Topic-2.ppt

  16. Tile AST with Operation Trees Goal is to “tile” AST with operation trees. •A tiling is collection of <ast-node, op-tree > pairs ◊ast-node is a node in the AST ◊op-tree is an operation tree ◊<ast-node, op-tree> means that op-tree could implement the subtree at ast-node •A tiling ‘implements” an AST if it covers every node in the AST and the overlap between any two trees is limited to a single node ◊<ast-node, op-tree> tiling means ast-node is also covered by a leaf in another operation tree in the tiling, unless it is the root ◊Where two operation trees meet, they must be compatible (expect the value in the same location) \course\cpeg421-10F\Topic-2.ppt

  17. a = a + 22; MOVE + + SP a MEM 22 + SP a Tree Walk by Tiling: An Example \course\cpeg421-10F\Topic-2.ppt

  18. Example a = a + 22; MOVE ld t1, [sp+a] t2 t3 + + add t2, t1, 22 t1 SP a MEM 22 add t3, sp, a + st [t3], t2 SP a \course\cpeg421-10F\Topic-2.ppt

  19. Example: An Alternative a = a + 22; MOVE t2 ld t1, [sp+a] + + t1 add t2, t1, 22 SP a MEM 22 st [sp+a], t2 + SP a \course\cpeg421-10F\Topic-2.ppt

  20. Finding Matches to Tile the Tree • Compiler writer connects operation trees to AST subtrees ◊ Provides a set of rewrite rules ◊ Encode tree syntax, in linear form ◊Associated with each is a code template \course\cpeg421-10F\Topic-2.ppt

  21. Generating Code in Tilings Given a tiled tree •Postorder treewalk, with node-dependent order for children ◊ Do right child before its left child •Emit code sequence for tiles, in order •Tie boundaries together with register names ◊ Can incorporate a “real” register allocator or can simply use “NextRegister++” approach \course\cpeg421-10F\Topic-2.ppt

  22. Optimal Tilings • Best tiling corresponds to least cost instruction sequence • Optimal tiling • no two adjacent tiles can be combined to a tile of lower cost \course\cpeg421-10F\Topic-2.ppt

  23. Dynamic Programming for Optimal Tiling • For a node x, let f(x) be the cost of the optimal tiling for the whole expression tree rooted at x. Then = + å f ( x ) ( cost( T ) f(y) ) min " " tile T covering x child y of tile T \course\cpeg421-10F\Topic-2.ppt

  24. Dynamic Programming for Optimal Tiling (Con’t) • Maintain a table: node x the optimal tiling covering node x and its cost • Start from root recursively: • check in table for optimal tiling for this node • If not computed, try all possible tiling and find the optimal, store lowest-cost tile in table and return • Finally, use entries in table to emit code \course\cpeg421-10F\Topic-2.ppt

  25. Peephole-based Matching • • Basic idea inspired by peephole optimization • •Compiler can discover local improvements locally • ◊ Look at a small set of adjacent operations • ◊Move a “peephole” over code & search for • improvement • A Classic example is store followed by load Original code Improved code st $r1,($r0) ld $r2,($r0) st $r1,($r0) move $r2,$r1 \course\cpeg421-10F\Topic-2.ppt

  26. Implementing Peephole Matching Early systems used limited set of hand-coded patterns Window size ensured quick processing Modern peephole instruction selectors break problem into three tasks Expander IRLLIR Simplifier LLIRLLIR Matcher LLIRASM IR LLIR LLIR ASM LLIR: Low Level IR ASM: Assembly Code \course\cpeg421-10F\Topic-2.ppt

  27. Implementing Peephole Matching (Con’t) Simplifier LLIRLLIR Expander IRLLIR Matcher LLIRASM IR LLIR LLIR ASM Simplifier •Looks at LLIR through window and rewrites it •Uses forward substitution, algebraic simplification, local constant propagation, and dead-effect elimination •Performs local optimization within window •This is the heart of the peephole system and benefit of peephole optimization shows up in this step Expander •Turns IR code into a low-level IR (LLIR) •Operation-by-operation, template-driven rewriting •LLIR form includes all direct effects •Significant, albeit constant, expansion of size Matcher •Compares simplified LLIR against a library of patterns •Picks low-cost pattern that captures effects •Must preserve LLIR effects, may add new ones •Generates the assembly code output \course\cpeg421-10F\Topic-2.ppt

  28. Some Design Issues of Peephole Optimization • Dead values • Recognizing dead values is critical to remove useless effects, e.g., condition code • Expander • Construct a list of dead values for each low-level operation by backward pass over the code • Example: consider the code sequence: r1=ri*rj cc=fx(ri, rj) // is this dead ? r2=r1+ rk cc=fx(r1, rk) \course\cpeg421-10F\Topic-2.ppt

  29. Some Design Issues of Peephole Optimization (Cont.) • Control flow and predicated operations • A simple way: clear the simplifier’s window when it reaches a branch, a jump, or a labeled or predicated instruction • A more aggressive way: to be discussed next \course\cpeg421-10F\Topic-2.ppt

  30. Some Design Issues of Peephole Optimization (Cont.) • Physical vs. Logical Window • Simplifier uses a window containing adjacent low level operations • However, adjacent operations may not operate on the same values • In practice, they may tend to be independent for parallelism or resource usage reasons \course\cpeg421-10F\Topic-2.ppt

  31. Some Design Issues of Peephole Optimization (Cont.) • Use Logical Window • Simplifier can link each definition with the next use of its value in the same basic block • Simplifier largely based on forward substitution • No need for operations to be physically adjacent • More aggressively, extend to larger scopes beyond a basic block. \course\cpeg421-10F\Topic-2.ppt

  32. An Example LLIR Code r10 2 r11  @y r12  r0 + r11 r13  MEM(r12) r14  r10 * r13 r15  @x r16  r0 + r15 r17  MEM(r16) r18  r17 - r14 r19  @w r20  r0 + r19 MEM(r20)  r18 Expand R12: y mem address R20: w mem address R13: y R14: t1 R17: x R18: w where (@x,@y,@ware offsets of x, y and w from a global location stored in r0 \course\cpeg421-10F\Topic-2.ppt

  33. An Example (Con’t) Simplify LLIR Code r10 2 r11  @y r12  r0 + r11 r13  MEM(r12) r14  r10 * r13 r15  @x r16  r0 + r15 r17  MEM(r16) r18  r17 - r14 r19  @w r20  r0 + r19 MEM(r20)  r18 LLIR Code r13  MEM(r0+ @y) r14  2 * r13 r17  MEM(r0 + @x) r18  r17 - r14 MEM(r0 + @w)  r18 \course\cpeg421-10F\Topic-2.ppt

  34. Introduced all memory operations & temporary names Turned out pretty good code An Example (Con’t) Match LLIR Code r13  MEM(r0+ @y) r14  2 * r13 r17  MEM(r0 + @x) r18  r17 - r14 MEM(r0 + @w)  r18 ILOC Assembly Code loadAI r0,@y  r13 multI 2 * r13 r14 loadAI r0,@x  r17 sub r17 - r14 r18 storeAI r18  r0,@w \course\cpeg421-10F\Topic-2.ppt

  35. Simplifier (3-operation window) LLIR Code r10 2 r11  @y r12  r0 + r11 r13  MEM(r12) r14  r10 * r13 r15  @x r16  r0 + r15 r17  MEM(r16) r18  r17 - r14 r19  @w r20  r0 + r19 MEM(r20)  r18 r10 2 r11  @y r12  r0 + r11 r10 2 r12  r0 + @y r13  MEM(r12) \course\cpeg421-10F\Topic-2.ppt

  36. Simplifier (3-operation window) LLIR Code r10 2 r11  @y r12  r0 + r11 r13  MEM(r12) r14  r10 * r13 r15  @x r16  r0 + r15 r17  MEM(r16) r18  r17 - r14 r19  @w r20  r0 + r19 MEM(r20)  r18 r10 2 r12  r0 + @y r13  MEM(r12) r10 2 r13  MEM(r0 + @y) r14  r10 x r13 \course\cpeg421-10F\Topic-2.ppt

  37. Simplifier (3-operation window) LLIR Code r10 2 r11  @y r12  r0 + r11 r13  MEM(r12) r14  r10 * r13 r15  @x r16  r0 + r15 r17  MEM(r16) r18  r17 - r14 r19  @w r20  r0 + r19 MEM(r20)  r18 r10 2 r13  MEM(r0 + @y) r14  r10 * r13 r13  MEM(r0 + @y) r14  2 * r13 r15  @x \course\cpeg421-10F\Topic-2.ppt

  38. Simplifier (3-operation window) LLIR Code r10 2 r11  @y r12  r0 + r11 r13  MEM(r12) r14  r10 * r13 r15  @x r16  r0 + r15 r17  MEM(r16) r18  r17 - r14 r19  @w r20  r0 + r19 MEM(r20)  r18 1st op it has rolled out of window r13  MEM(r0+ @y) r13  MEM(r0 + @y) r14  2 * r13 r15  @x r14  2 * r13 r15  @x r16  r0 + r15 \course\cpeg421-10F\Topic-2.ppt

  39. Simplifier (3-operation window) LLIR Code r10 2 r11  @y r12  r0 + r11 r13  MEM(r12) r14  r10 * r13 r15  @x r16  r0 + r15 r17  MEM(r16) r18  r17 - r14 r19  @w r20  r0 + r19 MEM(r20)  r18 r13  MEM(r0+ @y) r14  2 * r13 r15  @x r16  r0 + r15 r14  2 * r13 r16  r0 + @x r17  MEM(r16) \course\cpeg421-10F\Topic-2.ppt

  40. Simplifier (3-operation window) LLIR Code r10 2 r11  @y r12  r0 + r11 r13  MEM(r12) r14  r10 * r13 r15  @x r16  r0 + r15 r17  MEM(r16) r18  r17 - r14 r19  @w r20  r0 + r19 MEM(r20)  r18 r13  MEM(r0+ @y) r14  2 * r13 r16  r0 + @x r17  MEM(r16) r14  2 * r13 r17  MEM(r0+@x) r18  r17 - r14 \course\cpeg421-10F\Topic-2.ppt

  41. Simplifier (3-operation window) LLIR Code r10 2 r11  @y r12  r0 + r11 r13  MEM(r12) r14  r10 * r13 r15  @x r16  r0 + r15 r17  MEM(r16) r18  r17 - r14 r19  @w r20  r0 + r19 MEM(r20)  r18 r13  MEM(r0+ @y) r14  2 * r13 r14  2 * r13 r17  MEM(r0+@x) r18  r17 - r14 r17  MEM(r0+@x) r18  r17 - r14 r19  @w \course\cpeg421-10F\Topic-2.ppt

  42. Simplifier (3-operation window) LLIR Code r10 2 r11  @y r12  r0 + r11 r13  MEM(r12) r14  r10 * r13 r15  @x r16  r0 + r15 r17  MEM(r16) r18  r17 - r14 r19  @w r20  r0 + r19 MEM(r20)  r18 r13  MEM(r0+ @y) r14  2 * r13 r17  MEM(r0 + @x) r17  MEM(r0+@x) r18  r17 - r14 r19  @w r18  r17 - r14 r19  @w r20  r0 + r19 \course\cpeg421-10F\Topic-2.ppt

  43. Simplifier (3-operation window) LLIR Code r10 2 r11  @y r12  r0 + r11 r13  MEM(r12) r14  r10 * r13 r15  @x r16  r0 + r15 r17  MEM(r16) r18  r17 - r14 r19  @w r20  r0 + r19 MEM(r20)  r18 r13  MEM(r0+ @y) r14  2 * r13 r17  MEM(r0 + @x) r18  r17 - r14 r20  r0 + @w MEM(r20)  r18 r18  r17 - r14 r19  @w r20  r0 + r19 \course\cpeg421-10F\Topic-2.ppt

  44. Simplifier (3-operation window) LLIR Code r10 2 r11  @y r12  r0 + r11 r13  MEM(r12) r14  r10 * r13 r15  @x r16  r0 + r15 r17  MEM(r16) r18  r17 - r14 r19  @w r20  r0 + r19 MEM(r20)  r18 r13  MEM(r0+ @y) r14  2 * r13 r17  MEM(r0 + @x) r18  r17 - r14 r20  r0 + @w MEM(r20)  r18 r18  r17 - r14 MEM(r0 + @w)  r18 \course\cpeg421-10F\Topic-2.ppt

  45. Simplifier (3-operation window) LLIR Code r10 2 r11  @y r12  r0 + r11 r13  MEM(r12) r14  r10 * r13 r15  @x r16  r0 + r15 r17  MEM(r16) r18  r17 - r14 r19  @w r20  r0 + r19 MEM(r20)  r18 r13  MEM(r0+ @y) r14  2 * r13 r17  MEM(r0 + @x) r18  r17 - r14 r20  r0 + @w MEM(r20)  r18 r18  r17 - r14 MEM(r0 + @w)  r18 \course\cpeg421-10F\Topic-2.ppt

  46. An Example (Con’t) Simplify LLIR Code r10 2 r11  @y r12  r0 + r11 r13  MEM(r12) r14  r10 * r13 r15  @x r16  r0 + r15 r17  MEM(r16) r18  r17 - r14 r19  @w r20  r0 + r19 MEM(r20)  r18 LLIR Code r13  MEM(r0+ @y) r14  2 * r13 r17  MEM(r0 + @x) r18  r17 - r14 MEM(r0 + @w)  r18 \course\cpeg421-10F\Topic-2.ppt

  47. Making It All Work LLIR is largely machine independent Target machine described as LLIR  ASM pattern Actual pattern matching Use a hand-coded pattern matcher Turn patterns into grammar & use LR parser Several important compilers use this technology It seems to produce good portable instruction selectors Key strength appears to be late low-level optimization \course\cpeg421-10F\Topic-2.ppt

  48. Case Study: LLIR and Instruction Selection • LLIR stands for low-level intermediate representation • GCC RTL • Open64 CGIR • Instruction Selection Techniques in Real Compiler: • GCC: peephole based matching • LCC: tree pattern match method 2014/8/22 \course\cpeg421-07s\Topic-5.ppt \course\cpeg421-07s\Topic-5.ppt 48

  49. Case Study: Peephole Optimization in Kcc/Open64 • Aggressive simplifier using a logical window that spans a set of basic blocks (Extended Basic Blocks: EBO) • Basis • One definition for multiple uses • Constant propagation • Forward substitution • Others: pattern matching \course\cpeg421-10F\Topic-2.ppt

  50. KCC/Open64: Where Instruction Selection Happens? C++ C Fortran Source to IR Scanner →Parser → RTL → WHIRL Front End Machine Description f90 gfecc gfec VHO(Very High WHIRL Optimizer) Standalone Inliner W2C/W2F Very High WHIRL GCC Compile IPA • IPL(Pre_IPA) • IPA_LINK(main_IPA) ◦ Analysis ◦ Optimization LNO • Loop unrolling/ • Loop reversal/Loop fission/Loop fussion • Loop tiling/Loop peeling… lowering DDG High WHIRL W2C/W2F Middle End lowering PREOPT SSA WOPT • SSAPRE(Partial Redundency Elimination) • VNFRE(Value Numbering based Full Redundancy Elimination) RVI-1(Register Variable Identification) Middle WHIRL Machine Model lowering SSA Low WHIRL • RVI-2 • IVR(Induction Variable Recognition) lowering Some peephole optimization Very Low WHIRL •Cflow(controlflow opt), HBS (hyperblock schedule) • EBO (Extended Block Opt.) • GCM (Global Code Motion) • PQS (Predicate Query System) • SWP, Loop unrolling lowering Back End CFG/DDG CGIR WHIRL-to-TOP lowering IGLS(pre-pass) GRA LRA IGLS(post-pass) \course\cpeg421-10F\Topic-2.ppt • IGLS(Global and Local Instruction Scheduling) • GRA(Global Register Allocation) • LRA(Local Register Allocation) Assembly Code

More Related