Advanced Register Allocation Strategies in Compiler Optimization

Register Allocation btw, there are lots of examples, but I will probably forget to stop and let everyone digest or ask questions. please feel stop me if i do.

Overview • Base case algorithm • Optimization goals • Improvements at basic block level • Improvements at function level • Practicalities • Going further

Register allocation • Registers hold values • Sometimes only certain types • Used for the input and output of instructions • Exclusively, in the case of RISC architectures • There are not many • Generally less for CISCs than RISCs • Need to swap values in and out of memory • “Fills” and “spills” • Optimization problem: minimize # fills, spills • Hard (in the human and theoretical sense)

Pseudo-registers • Isolate the complexity of registers with p-regs • Post-pone decision of what to put in registers • The idea: pretend you have infinite registers • Simplifies AST  IR • Simplifies high-level IR optimizations • Infinite, but not dynamically addressable • Each use refers to one static p-reg, no indirection • Contrast with “the store” • Allows loads from dynamic addresses: lw $2, 4($1) add $1001, $999, $1000mul $1002, $1001, $1000

psuedo-registers the store ...$41$42... 0xffff00xffff40xffff80xffffc 0xffff0 7 lw $42, 4($41) 7

High-level organization Lex, parse, check, IR codegen Use p-regs. Put as much work as possible here… Optimization Register allocation Low level optimization / ISA codegen … and not here

Pseudo-registers • After high-level optimization, IR  ISA • No more psuedo-registers • The idea: • Not enough registers: “spill” to memory • Need spilled contents: “fill” from memory • Important distinction: • Load/stores (IR) vs. fills/spills (Regalloc) • For example: • [$1] := $2 • IR: 1 store, 2 read • Regalloc: 1 store, 0-2 fills • $2 := [$1] • IR: 1 load, 1 read, 1 write • Regalloc: 1 load, 0-1 fills, 0-1 spills

Simplest approach • Give every pseudo-register a home • The home is in memory, but separate from the store • Keep things in registers for as little time as possible • Every write to a p-reg spills • Every read from a p-reg fills • Efficient? • Extremely • Just kidding, this is the worst possible

Example IR Code generated Source A[x] = 3 mul p, x, 4 add q, p, A [q] := 3 lw $t1, x($fp) muli $t2, $t1, #4 sw $t2, p($fp) lw $t1, p($fp) lw $t2, A($fp) add $t3, $t1, $t2 sw $t3, q($fp) ldi $t1, 3 lw $t2, q($fp) sw $t1, 0($t2) But on the bright side, there only need be as many registers as there are operands.

Optimization goals • Clearly room for improvement • Want to minimize fills and spills • Same constraint as any optimization: • Preserve the observable behavior • That means for any execution path

Basic block level • Keep track of what p-reg is in what register • Avoid obviously redundant fill/fill, spill/fill • Handle multiple incoming paths to BB entry: • Reset records at beginning of BB • Handle multiple paths from BB to exit: • Spill every p-reg currently in a register • Described in more detail in dragon book

Function level • Global register allocation • Old term, “global” means intra-procedural • Minor improvement: • Use dataflow analysis • Live variables • Avoid useless spills: • Do not spill if not live • At end of BB • When spilling to make room

Function level • ...and now the big algorithm • Major improvement • Same idea at the heart of modern compilers • Actually, two approaches: • Top-down • Bottom-up (a.k.a graph-coloring)

Global register allocation • Top-down register allocation [Chow, 84] • “Use high-level information to make allocation decisions.” • Priority-function determines ordering • More pessimistic: assumes nothing live at the start • More conservative: courser definition of interference • [Briggs, 92] found O(n log n) for bottom-up and O(n2) for top-down • Research appears to favor bottom-up (# papers) • ? Industry too ?

Global register allocation • Bottom-up register allocation [Chaitin, 81][Briggs, 94] • Step 0: dataflow analyses • Step 1: build webs • Step 2: build interference graph • Step 3: coalesce • Step 4: compute spill costs • Step 5: color • Step 6: spill

Step 0: dataflow analyses • Given: IR • Build the CFG • Find reaching definitions • Find live variables

Step 0: dataflow analyses def x use x use x use xdef x def x def x use x

def x use x use x use xdef x def x def x use x Step 0: dataflow analyses Live variables

def x use x use x use xdef x def x def x use x Step 0: dataflow analyses Reaching defs

Step 1: build webs • A web is: • a set of statements whose definitions and uses of a given pseudo-register must share a physical register • (In the classic approach) all or nothing: • All reads/writes fill/spill or none do pseudoregister web physicalregister 1 1 * *

Step 1: build webs Web building approach: • Initially: • Each use and def points to a web containing only itself • For each statement: • For each use Uof a p-reg in the statement: • For each reaching definition D (from step 0) : • Merge D’s and U’s webs

def x def x use x use x use x use x use xdef x use xdef x def x def x def x def x use x use x Step 1: build webs def x * use x use x use xdef x def x def x use x Webs

Step 2: interference graph • An interference graph is a graph where: • Nodes are webs (step 1) • Edges are webs that cannot occupy the same physical register • Overly conservative approach • Two webs interfere if they are both live at any statement • Better: • Two webs interfere if one is live at the other’s definition

Step 2: interference graph Interference graph building approach: • For each web W: • For each defining statement S in W: • For each reaching and live (step 0) definition D at statement S that is not in W: • W interferes with the web containing D Store results as both: • Triangular adjacency matrix: • Efficient form for coalescing step • Adjacency list: • Efficient form for coloring step

def x use x use x def z def z def x def x def y def x use y use z use x Step 2: interference graph

def x use x use x def z def z def x def x def y def x use y use z use x Step 2: interference graph Alive-at-def vs. Alive-at-same-statement

Step 3: coalesce • Given a copy statement: a := b • If a’s web and b’s web do not interfere: • All uses of a can be replaced with b or vice-versa • a or b could be fixed (parameter or return register) • Eliminate copy instruction • Redundant copies often are introduced by optimizations • Can have a negative effect: • Live range is longer  less coloring flexibility  more spilling • Optimistic coalescing [Park, 1998] • Changes interference graph • Just merging edges is too conservative • Need to go back to step 2

Step 4: compute spill costs • Order webs by how expensive it is to spill • Take into account: • Number of uses and defs • Loop nesting depth • Possibility of rematerialization • This is a heuristic • Cannot generally know branch frequency, loop trip count (without profiling data)

Step 5: color • Problem: assign physical registers to webs • Reduces to map maker’s coloring problem: • Give each node of the interference graph a color property • Color = physical register • No two adjacent nodes can have the same color • Adjacent  edge between  cannot share a register • # available colors = # available registers • To address ISA restrictions: • Register classes: • Separate graphs • Other: • Add a node to the graph for every register • Register nodes are fully connected • Add an edge between a register and every web that cannot be allocated to that register

Step 5: color • Graph coloring is NPC for N >= 3 • But there are heuristics: • Don’t try to find the minimum: • Given k registers, try to use k colors • Just pick the best looking at the time • Might not be best overall • May not find solution, even if it exists • Acceptable and fast

Step 5: color Optimistic graph coloring approach: [Chaitin, 81][Briggs, 94] • Initially: • Each node’s degree is the number of adjacent nodes • Until there are no more uncolored nodes: • If there is a node with degree < k • Choose it • Otherwise • Choose node with lowest spill cost (Step 3) (optimism here) • Lower degree of chosen node’s neighbors • Push onto stack • For each node of the stack (LIFO) • If there is a color not yet assigned to neighbors: • Use that color • Else (optimistic failed; cold, hard reality) • Mark as spilled, keep uncolored

def x use x use x def z def z def x def x def y def x use y use z use x Step 5: color Trivial if # registers >= # webs

Step 5: color def x use x use x def z def z def x def x def y def x use y use z use x

Step 5: color def x Add some ISA restrictions: use x use x def z def z def x registers def x def y def x use y use z use x

Step 6: spill • Maybe do not even need to spill: • Rematerialization (should be chosen first, lowest spill cost) • Better register usage: • Insert new load and store instructions • Creates new, very short webs • Interference graph changed, need restart at Step 2 • Need to modify spill cost to make new web’s cost = ∞ • Simple approach: • Keep a set of registers reserved for filling/spilling • Add “spilt” flag to web • When emitting an instruction: • Load spilt webs of input into reserved registers • Execute • Use reserved register as destination of spilt web, then store

Epilogue: codegen • Now we know: • For each use/def, it’s web • For each web, whether it spills or not • If the web spills: • Same as the base case: reads fill, writes spill • Otherwise • Just use the web’s register as the operand $1  $s1$2  spills to 28($sp) $1  $s1$2  $s2 lw $t1, 28($sp)neg $s1, $t1 neg $1, $2 neg $s1, $s2

Practicalities • Webs that span calls: • Doesn’t appear to be addressed much [at all?] in literature • Caller- and callee-preserved registers • If a web spans a call, does it get split in two? • Not if it is callee-saved • Chicken-egg problem: • Splitting a caller-saved web changes the interference graph, makes it more colorable, which could change whether this web spans the call... • Simple heuristic: • During allocation: mark webs as either call-spanning or not • When picking registers: • Prefer caller-saved for non-spanning • Prefer callee-saved for spanning • If no callee-saved left, use caller-saved and spill/fill

Practical details • Parameters (passed on the stack) and globals • Want to keep in registers • Cannot for globals unless: • Simple: no calls • 10x harder: inter-procedural analysis says it’s ok • Need to insert “import” statements • Otherwise there will be use statements for a variable with no reaching defs; messes up algorithms • Where to put the imports? • CFG head • Makes long webs • Especially if variable only used near the end • As late as possible • Requires an analysis like Partial Redundancy Elimination

Further optimizations • Live-range splitting Good: split x Need to spill x or y Bad: spill x Bad: spill y def x • Create contains graph • During coloring, use contains graph to split before resorting to spilling • Other variations in [Cooper, 04] def x def,spill x def x use x use x fill,use x use x def x def x def,spill x def x spill x def y def y def y def,spill y use y use y use y fill,use y def y def y def y def,spill y fill x use x fill,use x use x use x def x def,spill x def x def x use x fill,use x use x use x

Further optimizations • Stack allocation for fills/spills • Goal: minimize stack usage • Essentially, it’s the same problem we just solved: • P-reg is to register file as home is to stack memory

Further optimizations • Alias analysis for heap • a and y have pseudo-registers, so they may be kept live • a->x does not have a pseudo-register: it has a dynamic location. Loads and stores generated during code generation, even before register allocation. • Start by creating indirect pseudo-registers and postponing loads/stores • More problems... void foo(A *a) { int y = 1; for (; a->x < 1000; ++a->x) y += a->x;}

Further optimizations • Alias analysis for heap • Need to generate import of a->x before first use. • Only when original program would have. Cannot introduce new memory accesses. • Aliasing is a problem: • Does a2 point to the same object as a? • Inter-procedural calls: • Does bar() modify the object a points to? • Solutions: • Easy: very conservative • 10x harder: points-to analysis void foo(A *a, A *a2) { int y = 1; for (; a->x < 1000; ++a->x) {bar(); y += a->x;a2->x *= a->x; }}

Further optimizations • Interaction with instruction scheduling and selection • Naïve approach: select, allocate, schedule • Not orthogonal • Scheduling goal: put as much space between reads and writes. • Allocation goal: want short live ranges, so put definitions and uses close together. • Need to balance both interests • GCC: select, allocate 1, schedule 1, allocate 2, schedule 2

Further optimizations • List of techniques at end of chapter 13 in: Keith D. Cooper and Linda Torczon. Engineering a Compiler. 2004

References [Briggs, 92] Preston Briggs. Register Allocation via Graph Coloring, Tech. Rept. CRPC-TR92218, Ctr. for Research on Parallel Computation, Rice Univ., Houston, TX, Apr. 1992. [Briggs, 94] Preston Briggs, Keith D. Cooper, and Linda Torczon. Improvements to graph coloring register allocation. ACM Transactions on Programming Languages and Systems, 16(3):428-255, May 1994. [Chaitin, 81] Gregory J. Chaitin. Register allocation and spilling via graph coloring. United States Paten 4,571,678, February 1986. [Chow, 84] Frederick C. Chow and John L. Hennessy. Register allocation by priority-based coloring. SIGPLAN Notices, 19(6):222-232, June 1984. Proceedings of the ACM SIGPLAN ’84 Symposium on Compiler Construction. [Park, 98] Jinpyo Park and Soo-Mook Moon. Optimistic register coalescing. In Proceedings of the 1998 International Conference on Parallel Architecture and Compilation Techniques (PACT), pages 196-204, October 1998.

Advanced Register Allocation Strategies in Compiler Optimization

Advanced Register Allocation Strategies in Compiler Optimization

Presentation Transcript

Register Allocation

Register Allocation

Register Allocation

Register Allocation

Register Allocation

Register allocation

Register Allocation

Register Allocation

Register Allocation

Register Allocation

Register allocation

Register Allocation

Register Allocation

Register Allocation

Register Allocation

Register Allocation

Register Allocation

Register Allocation

Register Allocation

Register Allocation

Register allocation

Register Allocation