220 likes | 311 Vues
Cross-Module Optimization. Thomas Lindgren ftl@acm.org. Overview. OM - optimization manager Erlang-to-Erlang optimizer (mostly) ~20k lines of Erlang intended to accelerate large applications The rest of this talk What does OM do? How well does it work?. Source code. Profiling code.
E N D
Cross-Module Optimization Thomas Lindgren ftl@acm.org
Overview • OM - optimization manager • Erlang-to-Erlang optimizer (mostly) • ~20k lines of Erlang • intended to accelerate large applications • The rest of this talk • What does OM do? • How well does it work?
Source code Profiling code Annotation trees Training exec Higher-order elimination Apply open-coding Outlining Module splitting (Other modules) aggregation Inlining Simplification Om overview Production exec
Profiling and annotation • Instrument code with profiling counters • standard counters (per function clause, per call site, …) • which modules call each other, how often • which function is used at apply • Annotations saved as syntax trees + counters • Post-training: read counters, decorate annotation trees, optimize the result
Per-module optimizations • Higher-order elimination: replace lists:map, lists:foldl, and others with specialized functions where suitable • Apply open-coding: replace apply with explicit (open-ended) switch • Outlining: cold (seldom-executed) clauses are moved out-of-line • Module splitting: cold code moved into new module
Higher-order elimination Call: lists_map_0(Xs,Y) lists_map_0([X|A],Y) -> [X+Y|lists_map_0(A,Y)]; lists_map_0([],Y) -> []. Call: lists:map( fun(X) -> X+Y end, Xs) (The equivalent is done for most functions in lists)
Per-module optimizations • Higher-order elimination: replace lists:map, lists:foldl, and others with specialized functions where suitable • Apply open-coding: replace apply with explicit (open-ended) switch • Outlining: cold (seldom-executed) clauses are moved out-of-line • Module splitting: cold code moved into new module
Apply open-coding • apply(M,F,[A1,…,An]) • Profiling reveals that certain {Mod,Func,Arity} tuples are most common • Switch on likely functions • Enables inlining of explicit call (e.g., m1:f1(A1,A2)) case {M,F,length(As)} of {m1,f1,2} -> [A1,A2] = As, m1:f1(A1,A2); … _ -> apply(M,F,As) end (most general case; optimization possible when arity known, when call is local, …)
Per-module optimizations • Higher-order elimination: replace lists:map, lists:foldl, and others with specialized functions where suitable • Apply open-coding: replace apply with explicit (open-ended) switch • Outlining: cold (seldom-executed) clauses are moved out-of-line • Module splitting: cold code moved into new module
Outlining • Move cold function clauses, switch clauses, ... out-of-line • Reduces function size => more inlining possible • outlining + inlining = (structured) partial inlining • Sometimes improves pattern matching code case read_file(FD,Len) of {error,closed} -> …; {error,prot} -> …; {ok,{incomplete,Data}} -> …; {ok,{complete,Data}} -> …; X -> ... end case read_file(FD,Len) of {ok,{complete,Data}} -> …; Else -> ‘OUTLINED’(Else) end
Per-module optimizations • Higher-order elimination: replace lists:map, lists:foldl, and others with specialized functions where suitable • Apply open-coding: replace apply with explicit (open-ended) switch • Outlining: cold (seldom-executed) clauses are moved out-of-line • Module splitting: cold code moved into new module
Module splitting • Hot code retained in original module • Cold functions moved into “cold module” • currently: duplicate entire original module • Calls to cold functions re-routed to cold module • outlined function clauses often end up in cold module • Benefit: reduces hot module size => more aggregation • drawback: total code size increases (unimportant?)
Aggregation • Optimization across module boundaries • but in Erlang, any module can be replaced at any time (“hot code loading”) • Merge optimized hot modules into aggregates • optimize each aggregate aggressively • but in Erlang you can replace any module at runtime • how to do it?
Hot code loading • Remote calls m:f(X) logically do the following: • lookup module named m • lookup function named f/1 in the found module • call the found function • A new version of m can be loaded at any time • but occurs seldom in practice (every month? week?) • (an aside: OTP further structures code replacement) • we do not take advantage of this
Hot code loading (2) • Inlining of remote calls is not possible • what if the inlined module subsequently changes? • worse, remote calls are very common • Merging two modules into one is problematic • making remote calls into local calls changes behaviour • safe approach: speculate that code has not changed.
Hot code loading (3) • Remote call is rewritten into test + common-case local call + backup remote call • latest(m) can be implemented in linker • initially, always true • when new m loaded, becomes always false m:f(X1,X2) (case latest(m) of true -> local_f(X1,X2); false -> m:f(X1,X2) end)
Aggregation • Merge modules that call each other often • use module-module call profile • remote calls are rewritten to use latest(m) • aggregation limited by size • Widely-shared modules (e.g., lists) are engulfed • copy engulfed module into the calling module • necessary to enable high-quality aggregation without huge aggregates
Post-aggregation optimization • Profile-guided inlining • consider call sites in order of importance (# calls) • total amount of inlining limited by code size increase • avoids pitfalls of static inlining: working on wrong code, too conservative for important sites • Simplification of resulting code • dead function removal (occurs due to engulfing, inlining) • case-of-case, beta reduction, ...
Results • Benchmarks: important subsystems of OTP, daily use • (decode1: protocol processing “inner loop”) • beam: beam compiler on lists.erl • gen_tcp: small messages over local socket • ldapv2: encoding and decoding LDAPv2 ASN.1 PDUs • mnesia: realtime database running simple pseudo-HLR • Benchmark suite freely available from author
Results (2) • Each benchmark compiled with OM • same input used for training and production • latest(m) simulated with cheap test • Each benchmark run 30-40 times for baseline and optimized • removed outliers for gen_tcp and mnesia to get more focussed speedup values
Conclusions • Optimization across modules beneficial • Profile-driven optimization practical and beneficial • Future work: • try real applications (100s-1000s of modules) • more optimizations • tune optimizations • automate reprofiling/recompilation