Active Library Infrastructure for Multicore Performance

Active Library Infrastructure for Multicore Performance Paul Kelly Software Performance Optimisation Group Imperial College London http://www.doc.ic.ac.uk/~phjk Joint work with Jay Cornwall, Tony Field, Francis Russell, Tony Field, Michael Mellor, Lee Howes, Olav Beckmann and others (at Imperial) Bruno Nicoletti, Phil Parsonage (at The Foundry Ltd) Dagstuhl, September 2007

PMUP • Parallel Programming for Ubiquitous Parallelism… • Just don’t do it • Promote high-level programming • With simple recipe for performance • That we can map and remap to diverse architectures • It’s not general-purpose • That’s a good thing • Let’s mop up lots of application domains • Let’s argue in five years about what’s left

Active libraries • Active libraries are libraries that play an active part in the compilation of their client code • For performance optimisation • For detecting errors • Potential performance benefits: • Cross-call loop fusion • Code variant selection • Matching pre-optimised composite operations • Intermediate results: choosing representation, layout, alignment, distribution • This talk reviews some of our experience building active libraries • Aim: design generic infrastructure to support active libraries

Active libraries - examples • A selection of active libraries we’ve developed • DESOBLAS (LCR98) • Parallel dense BLAS for clusters • Automatically selects array alignment to minimise redistribution • DESOLIN (LCSD06) • Serial dense BLAS (standard C++ and Matlab interfaces) • Aggressive loop fusion • MayaVi/DSI (LCPC05) • Large Python CFD visualisation tool based on VTK • Transparently parallelised for SMP and clusters (+ smart LoD, RoI) • Aggregation of remote method invocations in Java and .Net • Various run-time, static and hybrid implementations (Middleware 03, CC07) • Visual Effects for The Foundry (LCPC07) • Redesign of The Foundry’s Fundamental Image Processing Library • Aggressive, skewed, loop fusion, vectorisation (multicore PC)

Case study: Visual Effects • The Foundry is a small London company building visual effects plug-ins for the movie/TV industry (http://www.thefoundry.co.uk/) • Core competence: image processing algorithms • Core value: large body of C++ code based on library of image-based primitives • Opportunity 1: • Competitive advantage from exploitation of whatever hardware the customer may have - SSE, multicore, vendor libraries, GPUs • Opportunity 2: • Redesign of FIPL, the Foundry’s Image Processing Primitives Library • Risk: • Performance hacking delays delivery • Performance hacking reduces value of core codebase

Case study: Visual Effects • The brief: • Recommend an architecture for FIPL • That supports mapping onto diverse upcoming hardware • Key concerns: • Loop fusion across multiple components is essential to performance • Without it, SSE is useless • Without it, multicore speedup is small (see our forthcoming LCPC07 paper for full story)

Apple’s Shake compositing tool (http://www.apple.com/shake/) • The Foundry’s visual effects appear as nodes in the data flow graph on the right • We aim to optimise individual effects for multicore CPUs and to tunnel optimisations across graph boundaries at runtime.

The image degraining visual effect – a complete plug-in written by The Foundry • Random texturing noise introduced by photographic film is removed without compromising the clarity of the picture, either through analysis or by matching against a database of known film grain patterns • Based on undecimated wavelet transform • Up to several seconds per frame

One iteration of image degraining in component form • In the complete degraining effect this is repeated four times in sequence • Blue boxes represent components, green boxes represent data handles ImageHandle input (new Image (width, height, components)) ; ImageHandle pSum2(new Image (width, height, components)) ; ImageHandle LL(new Image (width, height, components)); ImageHandle highY, lowY, HH, LH, HL, HHx, LHx, HLx; VertDWT( inputH, highY, lowY, filterHeight, pass); HorizDWT( highY, HH, LH, filterWidth, pass); HorizDWT( lowY, HL, LL, filterWidth, pass); Proprietary (HH, HHx); Proprietary(LH, LHx); Proprietary(HL, HLx); Sum(HHx, LHx, pSum1); Sum(pSum1, HLx, pSum2); Programmer explicitly constructs representation of the dataflow by combining image handles

Each component is an instance of a generic “skeleton”: • Loop structure & dependence metadata comes from skeleton • Vertical wavelet transform step is instance of Filter1D skeleton • AccessRegion is instance parameter giving stencil width • Multiple implementations of loop body can be provided • Scalar, Vector shown – GPU etc planned

Code generation • Code is generated at run-time: • Using the wonderful CLooG library (Cédric Bastoul et al, http://www.cloog.org/) • Walk component dataflow DAG • Construct CLooG loop schedule • selecting loop kernels from components • Respecting dependence constraints imposed by the dependence metadata the components carry • Get CLooG to generate C loops • Compile, link into app and run* [ * The Foundry hate the idea of doing this dynamically – we actually implement an offline generator workaround so end-customers don’t need a compiler]

Code generation - fusion • Key optimisation is loop fusion • A little tricky…for example: for (i=1; i<N; i++) V[i] = (U[i-1] + U[i+1])/2 for (i=1; i<N; i++) W[i] = (V[i-1] + V[i+1])/2 • “Stencil” loops are not directly fusable

We make them fusable by shifting: V[1] = (U[0] + U[2])/2 for (i=2; i<N; i++) { V[i] = (U[i-1] + U[i+1])/2 W[i-1] = (V[i-2] + V[i])/2 } W[N-1] = (V[N-2] + V[N])/2 • The middle loop is fusable • We get lots of little edge bits

2,2 2,2 0,2 2,2 2,2 • We walk the dataflow graph and calculate the shift factor (in each dimension) required to enable fusion • Shift factors accumulate at each layer of the DAG • We build this shift factor into the CLooG schedule • CLooG takes care of generating all the loops, including the edge bits • Also need to introduce buffering to carry values from previous rows/columns (array contraction of intermediate images) 2,2 2,2 0,2 2,2

With enough shifting, all the loops in the degrain algorithm can be fused • Across all four iterations • Lots of edge bits • 90 loops • >15,000 lines of C code • It’s not always optimal to fuse blindly Performance of degraining on a 4000x3000x3 padded RGB float32 image. Compiled with Intel C/C++ 10.0.025 using flags ’-O3 -funroll-loops’ and an architecture-matched -x flag (-xW for Opteron) on a Linux 2.6 kernel in 64-bit mode where compatible.

Wavelet-based degrain algorithm • 4000x3000x3 padded RGB image • Dual Quad-Core Xeon X5355 2.6GHz (8 cores) with 8MB L2 cache • Much faster • Enables vectorisation to pay off • Enables multicore to pay off • With >4 cores limited by main memory bandwidth • They had already parallelised – but weren’t getting performance Compiled with Intel C/C++ 10.0.025 using flags ’-O3 -funroll-loops’ and an architecture-matched -x flag (-xW for Opteron) on a Linux 2.6 kernel in 64-bit mode where compatible.

Delivering active libraries… • Active library delivery techniques • Source-to-source • ROSE, Stratego, TXL • Template meta-programming • C++ but also Template Haskell, Ocamlp4 • Compiler plug-in (IR-to-IR) • Cf Engler’s Magik • Condate and Drvar for GCC • Binary-to-binary, bytecode-to-bytecode • Several Java and MSIL tools including our DeepWeaver • Offline or at class-load-time • Aspect weavers? • JIT plug-in • Valgrind skins, Pin, Dynamo/RIO, LLVM • Java and MSIL? • Delayed-evaluation • Generic proxy libraries for Python, Matlab • Call interposition – aspect weavers • Domain-specific interpreter • Runtime code generation • Explicit flow graph construction • Workflow composition tools • Multistage programming (MetaOcaml, our TaskGraph Metaprogramming Library)

Why are active libraries part of the multicore revolution? Why does multicore make active libraries (more) necessary? • What happens when different visual effects from different vendors contend for resources? • What is the methodology for remapping to a new target architecture? • What expertise is needed to build active libraries? What common tools, infrastructure, curriculum are needed to support this?

Background • Active Libraries: Rethinking the roles of compilers and libraries, Todd L. Veldhuizen and Dennis Gannon, 1998 • LCSD workshop series – Library-Centric Software Design • Impact of economics on compiler optimization. Arch Robison, ISCOPE/Java Grande 2001 • Domain-specific Program Generation. Lengauer et al (eds), Springer LNCS 3016, 2003.

Aggregating Java RMI - Motivation Client Server Client Server f f g g h h Network Network void m(RemoteObject r, int a) { int x = r.f(a); int y = r.g(a,x); int z = r.h(a,y); System.out.println(z); } a a x a,x y a,y a,z z Six messages Two messages, no need to copy x and y • Aggregation • A sequence of calls to same server can be executed in a single message exchange • Reduce number of messages • Also reduce amount of data transferred • Common parameters • Results passed from one call to another

MayaVi • Tool for visualising fluid flows • GUI supports interactive construction of visualisation pipelines • Eg Fluid flow past a heated sphere: temperature isosurface with temperature-shaded streamtubes • I’m going to show you how we dramatically improved MayaVi interactivity • By parallel execution on SMP • By parallel execution on linux cluster • By caching pre-calculated results • Without changing a single line of MayaVi or VTK code* • Without writing a compiler * Actually we did change a few lines in VTK to fix a problem with Python’s Global Interpreter Lock

MayaVi: Working on partitioned data • Our ocean simulations are generated in parallel • Input data consists of a set of partitions (and an XML index) • Normally, VTK fuses these partitions into one mesh as they are read

MayaVi: Working on partitioned data • Our ocean simulations are generated in parallel • Input data consists of a set of partitions (and an XML index) • Normally, VTK fuses these partitions into one mesh as they are read • Some – many – analyses can operate partition-by-partition

MayaVi: what the DSI has to do • Capture all delayable calls to methods from a DSL through a proxy layer • A force point is a call which requires an immediate result – in this case to render on screen • A recipe is the set of calls between consecutive force points (in parallel)

Active Library Infrastructure for Multicore Performance

Active Library Infrastructure for Multicore Performance

Presentation Transcript

Quality Performance Appraisal for Library Staff

Information Technology Infrastructure Library

Multicore for Science

Active Infrastructure Overview

Active Infrastructure Technical Overview

CCR Multicore Performance

An Active Library for Semi-Regular Grids

Performance of MapReduce on Multicore Clusters

A Middleware Infrastructure for Active Spaces

ITIL stands for I.T. Infrastructure Library

Accounting for differences in library performance

Active Probes for Performance Monitoring (APPM)

IT Infrastructure Library

PVTOL: Designing Portability, Productivity and Performance for Multicore Architectures

Performance Optimizations for NUMA-Multicore Systems

IT Infrastructure Library

Active Security Infrastructure

Performance Model for Future Multicore Process Designs

Auto-tuning Performance on Multicore Computers

Gaia An Infrastructure for Active Spaces

PVTOL: Designing Portability, Productivity and Performance for Multicore Architectures