1 / 0

A Study of superinstructions and dynamic mixin optimizations

A Study of superinstructions and dynamic mixin optimizations. 08D37074 Salikh ZAKIROV Chiba laboratory Advisors: Etsuya SHIBAYAMA Shigeru CHIBA. Outline. Introduction Superinstructions Dynamic mixin optimization Evaluation in compiled environment Conclusion. Dynamic languages.

della
Télécharger la présentation

A Study of superinstructions and dynamic mixin optimizations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Study of superinstructionsand dynamic mixin optimizations

    08D37074 Salikh ZAKIROV Chiba laboratory Advisors: Etsuya SHIBAYAMA Shigeru CHIBA
  2. Outline Introduction Superinstructions Dynamic mixin optimization Evaluation in compiled environment Conclusion
  3. Dynamic languages Exist from 70s, but became popular again in 90s Provide highest productivity Performance worse than static languages Results in limited applicability Hardware progress helps However performance still low Performance research is important!
  4. Dynamic language implementation Design trade-off: performance implementation complexity dynamicity responsiveness to run-time change
  5. Typical implementation approaches Performance Inline caching Interpreter AOT Compiler JIT Compiler Trace compiler Dynamicity
  6. The problem Trade-off of performance with dynamicity Popular dynamic languages are slow We use Ruby Metaprogramming is essential to high productivity Requires dynamicity Existing high-performance techniques has low dynamicity
  7. Focus of my work Ruby language Main implementation is VM interpreter Performance improvement While keeping high dynamicity
  8. Contributions of this work Researched application of superinstructions for Ruby Found a novel approach with higher benefit Arithmetic superinstructions Proposed inline caching for dynamic mixin fine-grained state tracking alternate caching
  9. Position of superinstructions Known for dispatch overhead reduction Low benefit for Ruby on modern hardware Arithmetic superinstructions Novel application of superinstructions Benefit on numeric applications Response to dynamic updates does not differ from original interpreter
  10. Position of dynamic mixin A variant of delegation Getting popular in recent research and practice Very slow with existing techniques We proposed novel optimization scheme Fine-grained state tracking Alternate caching
  11. Outline Introduction Superinstructions Dynamic mixin optimization Evaluation in compiled environment Conclusion
  12. Interpreter optimization efforts Many techniques to optimize interpreter were proposed Threaded interpretation Stack top caching Pipelining Superinstructions Superinstructions Merge code of operations executed in sequence Focus of this presentation
  13. Superinstructions (example) Optimizations applied PUSH: // put <imm> argument on stack    stack[sp++] = *pc++;goto **pc++;ADD: //  add two topmost values on stack    sp--;    stack[sp-1] += stack[sp];goto **pc++; PUSH_ADD: // add <imm> to stack top    stack[sp-1] += *pc++;goto **pc++; PUSH_ADD: // add <imm> to stack top    stack[sp++] = *pc++;//goto **pc++;    sp--;    stack[sp-1] += stack[sp];goto **pc++; Dispatch eliminated
  14. Superinstructions (effects) Effects Reduce dispatch overhead Eliminate some jumps Provide more context for indirect branch predictor by replicating indirect jump instructions Allow more optimizations within VM op
  15. Prior research result:Good for reducing dispatch overhead Superinstructions help when: VM operations are small (~10 hwop/vmop) Dispatch overhead is high (~50%) Examples of successful use in prior research ANSI C interpreter: 2-3 times improvement (Proebsting 1995) Ocaml: more than 50% improvement (Piumarta 1998) Forth: 20-80% improvement (Ertl 2003)
  16. Ruby does not fit well Superinstructions help when: VM operations are small (~10 hwop/vmop) Dispatch overhead is high (~50%) Only 1-3% misprediction overhead on interpreter dispatch BUT 60-140 hardware ops per VM op Hardware profiling data on Intel Core 2 Duo
  17. Superinstructions for Ruby We experimentally evaluated effect of “naive” superinstructions on Ruby Superinstructions are selected statically Frequently occurring in training run combinations of length 2 selected as superinstructions Training run uses the same benchmark Superinstructions constructed by concatenating C source code, C compiler optimizations applied
  18. Naive superinstructions effect on Ruby Limited benefit 4 benchmarks Normalized execution time Unpredictable effects Number of superinstructions used
  19. Branch mispredictions 2 benchmarks: mandelbrot and spectral_norm Normalized execution time Number of superinstructions used
  20. Branch mispredictions, reordered 2 benchmarks: mandelbrot and spectral_norm Normalized execution time Number of superinstructions used, reordered by execution time
  21. So why Ruby is slow? Profile of numeric benchmarks Garbage collection takes significant time Boxed floating point values dominate allocation
  22. Floating point value boxing Typical Ruby 1.9 VM operation OPT_PLUS:    VALUE a = *(sp-2);    VALUE b = *(sp-1);    /* ... */if (CLASS_OF(a) == Float && CLASS_OF(b) == Float) {        sp--;        *(sp-1) = NEW_FLOAT(DOUBLE_VALUE(a) + DOUBLE_VALUE(b));    } else {        CALL(1/*argnum*/, PLUS, a);    }goto **pc++; New “box” object is allocated on each operation
  23. Proposal: use superinstructions for boxing optimization 2 operation per allocation instead of 1 Boxing of intermediate result eliminated OPT_MULT_OPT_PLUS:    VALUE a = *(sp-3);    VALUE b = *(sp-2);    VALUE c = *(sp-1);    /* ... */if (CLASS_OF(a) == Float && CLASS_OF(b) == Float && CLASS_OF(c) == Float) {        sp-=2;        *(sp-1) = NEW_FLOAT(DOUBLE_VALUE(a) + DOUBLE_VALUE(b)*DOUBLE_VALUE(c));    } else {        CALL(1/*argnum*/, MULT/*method*/, b/*receiver*/);        CALL(1/*argnum*/, PLUS/*method*/, a/*receiver*/);    }goto **pc++;
  24. Implementation VM operations that handle floating point values directly: opt_plus opt_minus opt_mult opt_div opt_mod We implemented all 25 combinations of length 2 Based on Ruby 1.9.1 Using existing Ruby infrastructure for superinstructions with some modifications
  25. Limitations Coding style-sensitive Can be fixed by adding getVariable superinstructions Not applicable to other types (e.g. Fixnum, Bignum, String) Fixnum is already unboxed Bignum and String cannot be unboxed Sequences of 3 arithmetic instructions or longer virtually non-existent No occurrences in the benchmarks
  26. Results 0–22% faster on numeric benchmarks (avg 12%) No slowdown on other benchmarks
  27. Evaluation GC reduction explains most of the speedup reduction in boxing translates to reduction in GC count
  28. Slight modification produces 20% difference in performance 4 of 9 arithmetic instructions get merged into 2 superinstructions 24% reduction in float allocation Example: mandelbrot tweak Normalized execution time ITER.times do-        tr = zrzr - zizi + cr+        tr = cr + (zrzr - zizi) -        ti = 2.0*zr*zi + ci+        ti = ci + 2.0*zr*zi Alternative solution introduce OP-getdynamic-OP
  29. Discussion of alternative approaches Faster GC Superinstructions benefit reduced Tagged values 64 bit platforms only Stack allocation of intermediate results Dynamic specialization Type inference
  30. Summary Naive approach to superinstructions does not produce substantial benefit for Ruby Floating point values boxing overhead is a problem of Ruby Superinstructions provide some help (upto 23%, 12% on average) implementation of 2000 SLOC – regular, thus automatically generatable Dynamicity same as original interpreter
  31. Outline Introduction Superinstructions Dynamic mixin optimization Evaluation in compiled environment Conclusion
  32. Mixin code composition technique BaseServer BaseServer Server Server Additional Security Additional Security Mixin use declaration Mixin semantics
  33. Dynamic mixin Temporary change in class hierarchy Available in Ruby, Python, JavaScript Server Server Additional Security BaseServer BaseServer
  34. Dynamic mixin (2) Powerful technique of dynamic languages Enables dynamic patching dynamic monitoring Can be used to implement Aspect-oriented programming Context-oriented programming Widely used in Ruby, Python e.g. Object-Relational Mapping
  35. Dynamic mixin in Ruby Ruby has dynamic mixin but only “install”, no “remove” operation because there is uncertainty in “remove” semantics with transitive module inclusion “remove” can be implemented easily 23 lines of code
  36. Target application Mixin is installed and removed frequently Application server with dynamic features class BaseServer def process() … end end class Server < BaseServer def process() ifrequest.isSensitive() Server.class_eval { include AdditionalSecurity } end super # delegate to superclass … # remove mixin end end module AdditionalSecurity def process() … # security check super # delegate to superclass end end
  37. Overhead is high Possible reasons Invalidation granularity clearing whole method cache invalidating all inline caches next calls require full method lookup Inline caching saves just 1 target which changes with mixin operations even though mixin operations are mostly repeated
  38. Our research target Improve performance of application which frequently uses dynamic mixin Make invalidation granularity smaller Make dynamic dispatch target cacheable in presence of dynamic mixin operations
  39. Proposal Reduce granularity of inline cache invalidation Fine-grained state tracking Cache multiple dispatch targets Polymorphic inline caching Enable cache reuse on repeated mixin installation and removal Alternate caching
  40. Basics: Inline caching consider a call site method implementation Dynamic dispatch implementation (executable code) Expensive! But the result is mostly the same method = lookup(cat, ”speak”) method(cat) Animal Cat speak() { … } subclass Inline caching cat.speak() cat.speak() Cat class if (cat has type ic.class) { ic.method(cat) } else { ic.method = lookup(cat, ”speak”) ic.class = cat.class ic.method(cat) } method speak ic instance cat
  41. Inline caching: problem What if the method has been overridden? Animal Cat speak() { … } Training speak(){ … } Inline caching cat.speak() class Cat if (cat has type ic.class) { ic.method(cat) } else { ic.method = lookup(cat, ”speak”) ic.class = cat.class ic.method(cat) } method speak ic instance cat
  42. Inline caching: invalidation if (cat has type ic.class && state == ic.state) { ic.method(cat) } else { ic.method = lookup(cat, ”speak”) ic.class = cat.class; ic.state = state ic.method(cat) } 1 2 Global state Animal Cat speak() { … } Training speak(){ … } cat.speak() Cat class speak method speak ic Single global state object too coarse invalidation granularity instance state 1 2 cat
  43. Fine-grained state tracking Many state objects small invalidation extent share as much as possible One state object for each family of methods called from the same call site State objects associated with lookup path links updated during method lookups
  44. Lookup procedure Lookup normally noting the state objects encountered starting from the state object in the inline cache Choose the last state object other state objects mark for one-time invalidation mark as overridden Store the pointer to the state object in every class on lookup path
  45. Invariant 1: lookup path After any number of lookups for any IC = (pstate, state, class, name) Either of the following holds: IC is invalidated for any class’ ∊ lookup_path(class, name) (class’, name) ↪ state Proof scheme (induction over lookups) initial state is invalid lookup procedure establishes invariant on lookups that update state object links, state is linked transitively or invalidated
  46. Invariant 2: validity of IC After any number of updates for any IC = (pstate, state, class, name, target) Either of the following holds IC is invalidated target = lookup(class, name) Proof scheme (induction over updates) any update either does not affect lookup invalidates IC
  47. State object allocation if (cat has type ic.class &&ic.pstate.state == ic.state ) { ic.method(cat) } else { ic.method, ic.pstate = lookup(cat, ”speak”, ic.pstate) ic.class = cat.class; ic.state = state method(cat) } inline caching code Animal cat.speak() Cat class speak() { *1* } speak*1* method ic 1 1 No implemmentation here 1 state Cat pstate speak
  48. Mixin installation if (cat has type ic.class &&ic.pstate.state == ic.state ) { ic.method(cat) } else { ic.method, ic.pstate = lookup(cat, ”speak”, ic.pstate) ic.class = cat.class; ic.state = state method(cat) } inline caching code Training speak() { *2* } Animal cat.speak() Cat class speak() { *1* } speak *2* method speak*1* ic 1 2 2 1 1 2 state Cat pstate speak
  49. Mixin removal if (cat has type ic.class &&ic.pstate.state == ic.state ) { ic.method(cat) } else { ic.method, ic.pstate = lookup(cat, ”speak”, ic.pstate) ic.class = cat.class; ic.state = state method(cat) } inline caching code Training speak() { *2* } Animal cat.speak() Cat class speak() { *1* } method speak*1* speak *2* ic 3 3 2 state 2 3 2 Cat pstate speak
  50. Alternate caching alternate cache Detect repetition Conflicts detected by state check super Training Animal speak 4 3 … Training speak() { *2* } Animal A cat.speak() Cat class speak() { *1* } speak *2* speak*1* method ic 3 4 4 3 state Cat pstate speak Inline cache contents oscillates
  51. Polymorphic caching alternate cache Use multiple entries in inline cache super Training Animal speak 4 3 … Training speak() { *2* } Animal cat.speak() Cat class Cat speak() { *1* } method *1* *2* ic 3 4 state 3 4 Cat pstate speak now PIC handles oscillating state value
  52. Invariant 3: validity of alternate caching After any number of updates for any IC = (pstate, state, class, name, target) Either of the following holds value(pstate) != state (IC is invalid) target = lookup(class, name) Proof scheme (induction on updates) any update either introduce fresh value(pstate) for invalidation preserves correctness of cached target
  53. State object merge animal executable code instance animal.speak() while(true) { S Training cat.speak() speak() { *2* } Animal remove mixin } speak() { *1* } Q Q instance Overridden by cat Cat speak One-time invalidation
  54. Alternate caching limitations Independent mixins do not conflict If two mixins override the same method scopes need to be properly nested 1 1 State object value (for some method) 2 2 dynamic mixin scope 3 3 4 2 1 5
  55. Overheads of proposed scheme Increased memory use 1 state object per polymorphic method family additional methodentries alternate cache polymorphic inline cache entries Some operations become slower Lookup needs to track and update state objects Explicit state object checks on method dispatch
  56. Generalizations (beyond Ruby) Delegation object model track arbitrary delegation pointer change Thread-local delegation allow for thread-local modification of delegation pointer by having thread-local state object values
  57. Evaluation Implementation based on Ruby 1.9.2 about 1000 lines of code Hardware Intel Core i7 860 2.8 GHz
  58. Evaluation: microbenchmarks Single method call overhead Inline cache hit state checks 1% polymorphic inline caching 49% overhead Full lookup 2x slowdown
  59. Dynamic mixin-heavy microbenchmark (smaller is better)
  60. Evaluation: application Application server with dynamic mixin on each request (smaller is better)
  61. Evaluation Fine-grained state tracking considerably reduces overhead Alternate caching brings only small improvement Number of call sites affected by mixin is low Lookup cost / inline cache hit cost is low about 1.6x on Ruby
  62. Related work Dependency tracking in Self focused on reducing recompilation, rather than reducing method lookups Inline caching for Objective-C state object associated with method, no dynamic mixin support
  63. Summary We proposed combination of techniques Fine-grained state tracking Alternate caching Polymorphic inline caching To increase efficiency of inline caching with frequent dynamic mixin installation and removal
  64. Outline Introduction Superinstructions Dynamic mixin optimization Evaluation in compiled environment Conclusion
  65. Evaluation in compiled environment Dynamic mixin optimizations applicable to compiled systems too In order to confirm this hypothesis we implemented a dynamic compiler for the language IO we evaluated the performance of inline caching
  66. Idea: efficient dynamic mixin Control flow graph of compiled call site server.f() Repeated dynamic mixin install / removal State guard if (s == 1) s = 1 s = 2 if (s == 2) f1 f1 f1 Inlined method f1 handle inline cache miss f2 f2 continue
  67. Dynamic compiler Source language – IO only subset implemented Compilation uses profile collected in PICs PICs are initialized by interpreted execution
  68. Overhead of state checks is reasonable (less than 16 CPU cycles) The most common case commonly have just 1 cycle overhead Microbenchmark results
  69. Summary and ongoing work We evaluated inline caching optimization for dynamic mixin in compiled environment We verified that optimization is effective The most common case has very low overhead Future work The dynamic mixin switch operation may be slow – need to be addressed in future work Full IO language support
  70. Conclusion We researched two techniques Superinstructions for Ruby Dynamic mixin optimization Our techniques improve performance While keeping the advantages high dynamicity low implementation complexity
  71. Publications S. Zakirov, S. Chiba, E. Shibayama. How to select superinstructions, IPSJ PRO 2010. S. Zakirov, S. Chiba, E. Shibayama. Optimizing dynamic dispatch with fine-grained state tracking, DLS 2010.
More Related