1 / 39

Run Time Optimization

Run Time Optimization. 15-745: Optimizing Compilers Pedro Artigas. Motivation. A good reason Compiling a language that contains run-time constructs Java dynamic class loading Perl or Matlab eval(“statement”) Faster than interpreting A better reason

cindy
Télécharger la présentation

Run Time Optimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Run Time Optimization 15-745: Optimizing Compilers Pedro Artigas

  2. Motivation • A good reason • Compiling a language that contains run-time constructs • Java dynamic class loading • Perl or Matlab eval(“statement”) • Faster than interpreting • A better reason • May use program information only available at run time

  3. Example of run-time information • The processor that will be used to run the program • inc ax is faster on a Pentium III • add ax,1 is faster on a Pentium 4 • No need to recompile if generating code at run time • The actual program input/run-time behavior • Is my profile information accurate for the current program input? YES!

  4. Compile Link Load/Run The life cycle of a program One Object File Global Analysis One Binary Whole Program Analysis One Process Analysis? No observation! Larger scope, better information about program behavior

  5. New strategies are possible • Pessimistic x Optimistic approaches • Ex: Does int *a points to the same location as int *b ? • Compile time/Pessimistic: Prove that in ANY execution those pointers point to different addresses • Run Time/Optimistic: Up to now in the current execution a and b point to different locations • Assume this holds • If the assumption breaks, invalidate generated code and generate new code

  6. A sanity check • Using run time information does not require run time code generation • Example: Versioning • ISA may allow cheaper tests • IA-64 • Transmeta if (a!=b) { <generate code assuming a!=b> } else { <generate code assuming a==b> }

  7. Drawbacks • Code generation has to be FAST • Rule of thumb: almost linear on program size • Code quality: Compromise on quality to achieve fast code generation • shoot for good, not great • Also this usually means: • No time for classical Iterative Data Flow Analysis at run time

  8. No classical IDFA: Solutions • Quasi-Static and/or Staged Compilation • Perform IDFA at compile time • Specialize the dynamic code generator for the obtained information • That is, encode the obtained data flow information in the “binary” • Do not rely on classical IDFA • Use algorithms that do not require it • Ex: Dominator based value numbering (coming up!) • Generate code in a style that does not require it • Ex: One entry multiple exits traces • as in deco and dynamo

  9. Code generation Strategies • Compiling a language that requires run-time code generation: • Compile adaptively: • Use a very simple and fast code generation scheme • Re-compile frequently used regions using more advanced techniques

  10. Adaptive Compilation: Motivation • Very simple code generation • Higher execution cost • Elaborate code generation • Higher compilation cost • Problem: • We may not know in advance how frequently a region will execute • Measure frequencies and re-compile dynamically Fast compiler Optimizing compiler Cost threshold

  11. Code generation Strategies • Compiling selected regions that benefit from run-time code generation: • Pick only the regions that should benefit the most • Which regions? • Select them statically • Use profile information • Re-compile (that is select then dynamically) • Usually all of the above

  12. 1 2 3 4 Code Optimization Unit • What is the run-time unit of optimization? • Option: Procedures/static code regions • Similar to static compilers • Option: Traces • Start at the target of a backward branch • Include all the instructions in a path • May include procedure calls and returns • Branches • Fall through = remain in the trace • Target = exit the trace 1 2 3 4 4

  13. Current strategies

  14. Run-Time code generation:Case studies • Two examples of algorithms that are suitable for run-time code generation • Run time CSE/PRE replacement: • Dominator based value numbering • Run time Register Allocation: • Linear scan register allocation

  15. A+B A+B A+B A+B Sidebar • With traces CSE/PRE become almost trivial • No need for register allocation if optimizing a binary (ex: dynamo) PRE CSE

  16. Review: Local value numbering • Store expressions already computed (in a hash table) • Store variable nameVN mapping in the VN array • Store VNvariable name mapping in the Name array • Same value numbersame value for each basic block Table.empty() for each computed expression (“x=y op z”) if V=Table.lookup(“y op z”) VN[“x”]=V if VN[Name[V]]==V //expression is still there replace “x = y op z” with “x = Name[V]” else Name[V]=“x” else VN[“x”]=new_value_number() Table.insert(“y op z”,VN[“x”]) Name[VN[“x”]]=“x” Expression was computed in the past, check if result is available New expression, add to the table

  17. Local value numbering • Works in linear time on program size • Assuming accesses to the array and the hash table occur in constant time • Can we make it work in a scope larger than a basic block? (Hint: Yes) • What are the potential problems?

  18. Problems • How to propagate the hash table contents across basic blocks? • How to make sure that is safe to access the location containing the expression in other basic blocks? • How do we make sure if the location containing the expression is fresh? • Remember: no IDFA

  19. Control flow issues • On split points things are simple • Just keep the content of the hash table from the predecessor • What about merge points? • We do not know if the same expression was computed in all incoming paths • We do not want to check the fact anyway (why?) • Reset the state of the hash table to a safe state it had in the past • Which program point in the past? • The immediate dominator of the merge block

  20. Data flow issues • Making sure the def of an expression is fresh and reaches the blocks of interest • How? • By construction! SSA • All names are fresh (Single Assignment) • All defs dominate its’ uses (regular uses not  functions) • As, by construction, we introduce new defs using  functions at every point this would not hold

  21. Dominator/SSA based value numbering DVN(Block B) Table.PushScope() for each exp “n=(…)” if (exp is redundant or meaningless) //meaningless: (x0,x0) VN[“n”]= Table.lookup(“(…)” or “x0”) remove(“n=(…)”) else VN[“n”]=“n” Table.insert(“(…)”,VN[n]) for each exp “x=y op z” if (“v”=Table.lookup(“y op z”)) VN[“x”]=“v” remove(“x=y op z”) else VN[“x”]=“x” Table.insert(“x=y op z”,VN[“x”]) for each successor s of B Adjust the  inputs for each dominator tree child c in CFG reverse post-order DVN(c) Table.PopScope() First process the  expressions Them the regular ones Propagate info about inputs and call DVN recursively

  22. 1 u0=a0+b0 v0=c0+d0 w0=e0+f0 3 2 u1=a0+b0 x1=e0+f0 y1=e0+f0 x0=c0+d0 y0=c0+d0 4 u2= (u0,u1) x2=(x0,x1) y2=(y0,y1) u3=a0+b0 VN Example

  23. Problems • Does not catch • But it performs almost as well as CSE • And runs much faster • linear time ? (YES? NO?) x0=a0+b0 x0=a0+b0 x1=a0+b0 x1=(x0,x2) x2=a0+b0 x2=(x0,x1)

  24. Homework #4 • The DVN algorithm scans the CFG in a similar way as the second phase of SSA translation • SSA translation phase #1 • Placing  functions • SSA translation phase #2 • assigning unique numbers to variables • Combine both and save one pass • Gives us a smaller constant • But, at run time, it pays of!

  25. Run time register allocation • Graph Coloring? Not an option • Even the simple stack based heuristic shown in class is O(n2) • Not even counting: • Building the graph • Move coalescing optimization • But register allocation is VERY important in terms of performance • Remember, memory is REALLY slow • We need a simple but effective (almost) linear time algorithm

  26. Let’s start simple • Start with a local (basic block) linear time algorithm • Assuming only one def and one use per variable (More constrained than SSA) • Assuming that if a variable is spilled it must remain spilled (Why?) • Can we find an optimum linear time algorithm? (Hint: Yes) • Ideas? • Think about liveness first …

  27. Simple Algorithm:Computing Liveness • One def and one use per variable, only one block • A live range is merely the interval between the def and the use • Live Interval: Interval between the first def and the last use • OBS: Live Range = Live Interval if there is no control flow, only one def and use • We could compute live intervals using a linear scan if we store the def instructions (beginning of the interval) in a hash table

  28. Example S1: A=1 S2: B=2 S3: C=3 S4: D=A S5: E=B S6: use(E) S7: use(D) S8: use(C)

  29. Now Register Allocation • Another linear scan • Keep the active intervals in an list (active) • Assumption: an interval, when spilled, will remain spilled • Two scenarios • #1: • No problem • #2: • Must spill • Which interval?

  30. Spilling heuristic • Since there is no second chance: • That is a spilled variable will always remain spilled • Spill the interval that ends last • Intuition: As one spill must occur … • Pick the one that makes the remaining allocation least constrained • That is, the interval that ends last • This is the provably optimum solution (given all the constraints)

  31. Linear Scan Register Allocation active = {} freeregs = {all_registers} for each interval I (in order of increasing start point) for each interval J in active if J.end>I.start continue active.remove(J) freeregs.insert(J.register) end for each interval J if active.length()==R spill_candidade=active.last(); if (spill_candidate.end>I.end) I.register = spill_candidate.register spill(spill_candidate) active.remove(spill_candidate) active.insert_sorted(I) //sorted by end point else spill(I) else I.register = freeregs.pop() //get any register from the free list active.insert_sorted(I) //sorted by end point end for each interval I Expire old intervals Must spill, pick either the last interval in active or the new interval No constraints

  32. Example (R=2) A S1: A=1 S2: B=2 S3: C=3 S4: D=A S5: E=B S6: use(E) S7: use(D) S8: use(C) B C D E

  33. Is the second pass really linear? • Invariant: active.length()<=R • Complexity O(R*n) • R is usually a small constant (128 at most) • Therefore: O(n)

  34. And we are done! Right? • YES and NO • Use the same algorithm as before for register assignment • Program representation: Linear list of instructions • Live intervals are not precise anymore given control flow and multiple def/uses • Not optimum, but still FAST • Code quality: within 10% of graph coloring for spec95 benchmarks (One problem with this claim)

  35. The Worst problem: Obtaining precise live intervals • How to obtain precise live interval information FAST? • Claim of 10% relies on live interval information obtained using liveness analysis (IDFA) • IDFA is SLOW, O(n3) • Most recent solutions: • Use the local interval algorithm for variables that only live inside one basic block • Use liveness analysis for more global variables • Alleviates the problem, does not fully solve it

  36. More problems: Live intervals may not be precise OBS: The idea of lifetime holes leads to allocators that also try to use this holes to assign the same register to other live ranges (bin-packing) Such an allocator is used in the Alpha family of compilers (GEM compilers)

  37. Other problems: Linearization order • Register allocation quality depends on chosen block linearization order • Choose a good order in practice • layout order • depth first traversal of the CFG • Both only 10% slower than graph coloring

  38. Graph coloring versus Linear scan Compilation cost scaling

  39. Conclusion • Run time code generation provides new optimization opportunities • Challenges • Identify new optimization opportunities • Design new compilation strategies • example: optimistic versus conservative • Design algorithms and implementations that: • minimize run time overhead • Do not compromise much on code quality • Recent examples indicate: • extending fast local methods is a promising way to obtain fast run-time code generation

More Related