Optimizing the Use of High Performance Software Libraries

Optimizing the Use of High Performance Software Libraries Samuel Z. Guyer Calvin Lin University of Texas at Austin

Overview • Libraries are programming languages • Extend the capabilities of the base language • Define new data types and operators • Problem: languages without compilers • Programmers are responsible for performance • Optimization opportunities are lost • Solution: a compiler for libraries • Extend the compiler to library types and operators • Support library-leveloptimizations

Outline • Motivating example • System architecture • Details – by example • Related work • Conclusions

1 2 4 3 5 6 Example: Math Library Source Code for (i=1;i<=N;i++) { d1 = 2.0 * i; d2 = pow(x, i); d3 = 1.0/z; d4 = cos(z); uint = uint/4; d5 = sin(y)/cos(y); } Traditional Optimizer d1 = 0.0; d3 = 1.0/z; for (i=1;i<=N;i++) { d1 += 2.0; d2 = pow(x, i); d4 = cos(z); uint = uint >> 2; d5 = sin(y)/cos(y); } Library-level Optimizer d1 = 0.0; d2 = 1.0; d3 = 1.0/z; d4 = cos(z); for (i=1;i<=N;i++) { d1 += 2.0; d2 *= x; uint = uint >>2; d5 = tan(y); } • How can a compiler do this automatically?

Application source code Compiled code Optimized, integrated Broadway Compiler Library Header files + source code + Annotations System Architecture • The Broadway compiler • Configurable compiler mechanisms • Annotation file • Conveys library-specific information • Accompanies the usual library files

Benefits • Practical • One set of annotations for many applications • Works for existing libraries and applications • Development process essentially unchanged • Conceptual: separation of concerns • Compiler provides the mechanisms • Annotations provide the library-specific expertise • Application developer can focus on design • OK, but how does it work?

Specifying Optimizations • Problem: configurability • Each library has its own optimizations • Solution: pattern-based transformations • Pattern: code template with meta-variables • Action: replace, remove, move code pattern { ${obj:y} = sin(${obj:x})/cos(${obj:x}); } { replace { $y = tan($x); } } • What about non-functional interfaces?

Surface A A_upper A_lower Internal view1 view2 view3 data1 Non-functional Interfaces • Problem: library calls have side-effects • Pass-by-reference using pointers • Complex data structures • Example: PLAPACK parallel linear algebra • Manipulate distributed matrices through views PLA_Obj_horz_split_2(A, height, & A_upper, & A_lower) [van de Geijn 1997]

Dependence Annotations • Solution: explicit dependence information • Summarizes library routine behavior • Requires heavy-duty pointer analyzer [Wilson & Lam, 1995] • Supports many traditional optimizations procedure PLA_Obj_horz_split_2(A, height, A_upper, A_lower) { on_entry { A  view1  data1; } access { view1, height } modify {} on_exit { A_upper  new view2  data1; A_lower  new view3  data1; } }

Processor grid PLA_Gemm( , , );  PLA_Local_gemm PLA_Gemm( , , );  PLA_Rankk Domain Information • PLAPACK matrices are distributed • Optimizations exploit special cases • Example: Matrix multiply • Problem: how to extract this information? • How do we describe this property? • How do we track it through the program?

Library-specific Analysis • Solution: configurable dataflow analyzer • Compiler provides interprocedural framework • Library defines flow values • Each library routine defines transfer functions • Issue: how much configurability? • Avoid exposing the underlying lattice theory • Simple flow value type system

Analysis Annotations • Accompany the dependence annotations procedure PLA_Obj_horz_split_2(A, height, A_upper, A_lower) { on_entry { A  view1  data1; } access { view1, height } modify {} on_exit { A_upper  new view2  data1; A_lower  new view3  data1; } } property Distribution : map-of<object,{General, RowPanel, ColPanel, Local, Empty}>; analyze Distribution { (view1 == General) => view2 = RowPanel, view3 = General; (view1 == ColPanel) => view2 = Local, view3 = ColPanel; (view1 == Local) => view2 = Local, view3 = Empty; }

Using the Results of Analysis • Patterns can test the flow values pattern { PLA_Gemm(${obj:A}, ${obj:B}, ${obj:C}); } { when ((Distribution[viewA] == Local) && (Distribution[viewB] == Local) && (Distribution[viewC] == Local)) replace { PLA_Local_gemm($A, $B, $C); } on_entry { A  viewA  dataA; B  viewB  dataB; C  viewC  dataC; } } • DFA and patterns are complementary

Status • Prototype • Partially automated • Significant speed-up for PLAPACK • Current status • Interprocedural pointer and dependence analyzer • Continuing work on annotation language • Dataflow analyzer and pattern matcher in progress

Related Work • Supporting work • Dataflow analysis, pointer analysis • Pattern matching, partial evaluation • Pattern-based code generators • Configurable compilers • Optimizer generators (Genesis) • PAG abstract interpretation system • Open compilers (Magik, SUIF, MOPS) • Software generators and transformation systems • Specialization (Synthetix, Speckle)

Conclusions • Many opportunities • Many existing libraries and applications • Future: class libraries • Not easy • Complexity of libraries • Configurability – power versus usability • We have a promising solution • Good initial results • Many interesting research directions

Cholesky (3072×3072) PLA_Trsm kernel 300 MFLOPS MFLOPS/Proc Broadway Hand-optimized Baseline 0 0 0 40 Processors 500 4500 Matrix Size PLAPACK Results • Compared three Cholesky programs • Baseline: clean and simple, but still fast • Hand-optimized by PLAPACK group • Broadway: automatic analysis, manual transforms

Optimizing the Use of High Performance Software Libraries

Optimizing the Use of High Performance Software Libraries

Presentation Transcript

High Performance Software Defined Radio

Optimizing Network Performance

Optimizing System Performance

Optimizing the Use of Discussion Board Forums

Numerical Libraries in High Performance Computing

Albridge - Optimizing the Performance of Financial Services Firms

Albridge - Optimizing the Performance of Financial Services Firms

Optimizing Performance

Optimizing the Use of Clickers in the Classroom

Optimizing Performance

Optimizing Performance of HPC Storage Systems

Optimizing the Use of Data Standards

OPTIMIZING THE PERFORMANCE OF PLASMA BASED MICROTHRUSTERS*

Case studies in Optimizing High Performance Computing Software

Optimizing Herbicide Performance

Use of High-Performance Computing in Physics

Optimizing Performance 2

PUBLIC LIBRARIES AND HIGH PERFORMANCE BROADBAND ...NOW WHAT?

Optimizing Pipeline Performance Market

Optimizing The Use of Fly Ash for High Strength & Durability in Concrete

Optimizing the Performance of Sparse Matrix-Vector Multiplication

Optimizing System Performance