STAPL: A High Productivity Programming Infrastructure for Parallel & Distributed Computing

STAPL: A High Productivity Programming Infrastructure for Parallel & Distributed Computing Lawrence Rauchwerger Parasol Lab, Dept of Computer Science Texas A&M University http://parasol.tamu.edu/~rwerger/

Motivation • Parallel programming is Costly • Parallel programs are not portable • Scalability & Efficiency is (usually) poor • Dynamic programs are even harder • Small scale parallel machines: ubiquitous

Our Approach: STAPL • STAPL: Parallel components library • Extensible, open ended • Parallel superset of STL • Sequential inter-operability • Layered architecture: User – Developer - Specialist • Extensible • Portable (only lowest layer needs to be specialized) • High Productivity Environment • components have (almost) sequential interfaces.

STAPL Specification • STL Philosophy • Shared Object View • User Layer: No explicit communication • Machine Layer: Architecture dependent code • Distributed Objects • no replication • no software coherence • Portable efficiency • Runtime System virtualizes underlying architecture. • Concurrency & Communication Layer • SPMD (for now) parallelism

goal start obstacles Motion Planning STAPL Applications • Motion Planning • Probabilistic Roadmap Methods for motion planning with application to protein folding, intelligent CAD, animation, robotics, etc. • Molecular Dynamics • A discrete event simulation that computes interactions between particles. Protein Folding

Seismic Ray Tracing STAPL Applications • Particle Transport Computation • Efficient Massively Parallel Implementation of Discrete Ordinates Particle Transport Calculation. • Seismic Ray Tracing • Simulation of propagation of seismic rays in earth’s crust. Particle Transport Simulation

STAPL Overview pRange Runtime System pContainer Scheduler Executor pAlgorithm • Data is stored in pContainers • Parallel equivalents of all STL containers & more (e.g., pGraph) • STAPL provides generic pAlgorithms • Parallel equivalents of STL algorithms & more (e.g., list ranking) • pRangesbindpAlgorithms to pContainers • Similar to STL iterators, but also support parallelism

STAPL Overview • pContainers • pRange • pAlgorithms • RTS & ARMI Communication Infrastructure • Applications using STAPL

pContainer Overview pContainer:A distributed (no replication) data structure with parallel (thread-safe) methods • Ease of Use • Shared Object View • Handles data distribution and remote data access internally (no explicit communication) • Efficiency • De-centralized distribution management • OO design to optimize specific containers • Minimum overhead over STL containers • Extendability • A set of base classes with basic functionality • New pContainers can be derived from Base classes with extended and optimized functionality

Non-partitioned Shared Memory View of Data STAPL pContainer Partitioned Shared Memory View of Data STAPL RTS and ARMI Data In Shared Memory Data In Distributed Memory Data In Distributed Memory pContainer Layered Architecture • pContainer provides different views for users with different needs/levels of expertise • Basic User view: • a single address space • interfaces similar to STL containers • Advanced User view: • access to data distribution info to optimize methods • can provide customized distributions that exploit knowledge of application

pContainer Design • Base Sequential Container • STL Containers used to store data • Distribution Manager • provides shared object view • BasePContainer

STAPL Overview • pContainers • pRange • pAlgorithms • RTS & ARMI Communication Infrastructure • Applications using STAPL

pRange Overview • Interface between pAlgorithms and pContainers • pAlgorithms expressed in terms of pRanges • pContainers provide pRanges • Similar to STL Iterator • Parallel programming support • Expression of computation as parallel task graph • Stores DDGs used in processing subranges • Less abstract than STL iterator • Access to pContainer methods • Expresses the Data—Task Parallelism Duality

pRange • View of a work space • Set of tasks in a parallel computation • Can be recursively partitioned into subranges • Defined on disjoint portions of the work space • Leaf subrange in the hierarchy • Represents a single task • Smallest schedulable entity • Task: • Function object to apply • Using same function for all subranges results in SPMD • Description of the data to which function is applied

pRange Example pRange defined on application data Application data stored in pMatrix • Each subrange is a task • Boundary of each subrange is a set of cut edges • Data from several threads in subrange • If pRange partition matches data distribution then data access is all local Subrange 1 Subrange 2 Thread 2 Thread 1 Thread 1 Function Dependences between subranges Function Thread 2 Subrange 3 Subrange 5 Subrange 6 Subrange 4 Function Function Function Function

pRange Example • Subranges of pRange • Matrix elements in several subranges • Each subrange has a function object • Each subrange has a boundary and a function object • Data from several threads in subrange • pMatrix is distributed • If subrange partition matches data distribution then all data access is local • DDGs can be defined on subranges of the pRange and on elements inside each subrange • No DDG is shown here • Partitioning of subrange • Subranges can be recursively partitioned • Each subrange has a function object

Overview • pContainers • pRange • pAlgorithms • RTS & ARMI Communication Infrastructure • Applications using STAPL

pAlgorithms • pAlgorithm is a set of parallel task objects • input for parallel tasks specified by the pRange • (Intermediate) results stored in pContainers • ARMI for communication between parallel tasks • pAlgorithms in STAPL • Parallel counterparts of STL algorithms provided in STAPL • STAPL contains additional parallel algorithms • List ranking • Parallel Strongly Connected Components • Parallel Euler Tour • etc

Algorithm Adaptivity in STAPL • Problem: Parallel algorithms highly sensitive to: • Architecture – number of processors, memory interconnection, cache, available resources, etc • Environment – thread management, memory allocation, operating system policies, etc • Data Characteristics – input type, layout, etc • Solution: adaptively choose the best algorithm from a library of options at run-time • Adaptive Patterns ?

Adaptive Framework Installation Benchmarks Overview of Approach • GivenMultiple implementation choices for the same high level algorithm. • STAPL installationAnalyze each pAlgorithm’s performance on system and create a selection model. • Program executionGather parameters, query model, and use predicted algorithm. Data Repository Architecture & Algorithm Environment Performance Model STAPL User Parallel Algorithm Choices Code Adaptive Executable Data Characteristics Run-Time Tests Selected Algorithm

Model generation • Installation Benchmarking • Determine parameters that may affect performance(i.e., num procs, input size, algorithm specific…) • Run all algorithms on a sampling of instance space • Insert timings from runs into performance database • Create a Decision Model • Generic interface enables learners to compete • Currently: decision tree, neural net, Bayes naïve classifier • Based on predicted accuracy (10-way validation test). • “Winning” learner outputs query function in C++func* predict_pAlgorithm(attribute1, attributes2, ..)

Runtime Algorithm Selection • Gather parameters • Immediately available (e.g., num procs) • Computed (e.g., disorder estimate for sorting) • Query model and execute • Query function statically linked at compile time.Current work: dynamic linking with online model refinement.

Experiments • Investigated two operations • Parallel Sorting • Parallel Matrix Multiplication • Three Platforms • 128 processor SGI Altix • 1152 nodes, dual processor Xeon Cluster • 68 nodes, 16 way IBM SMP Cluster

Parallel Sorting Algorithms • Sample Sort • Samples to define processor bucket thresholds • Scan and distribute elements to buckets • Each processor sort local elements • Radix Sort • Parallel version of linear time sequential approach. • Passes over data multiple times, each time considering r bits. • Column Sort • O(lg n) on running time • Requires 4 local sorts and 4 communication steps • Uses pMatrix data structure for workspace

Attributes used to model sorting decision Processor Count Data Type Input Size Max Value Smaller value ranges may favor radix sort by reducing required passes Presortedness Generate data by varying initial state (sorted, random, reversed) and % displacement Quantify at runtime with normalized average distance metric derived from input sampling Sorting Attributes

Training Set Creation • 1000 Training inputs per platform by uniform random sampling of space defined below: *P = 64 linux cluster, frost **only for sorted and reverse sorted

Model Error Rate • Model accuracy with all training inputs is 94%, 98% and 94% on Cluster, Altix, and SMP Cluster

Altix Cluster SMPCluster F(p, dist_norm) F(p, n, dt, dist_norm, max) F(p, n, dt, dist_norm, max)

Parallel Sorting: Experimental Results Altix Model Decision Tree Validation (N=120M) on Altix Interpretation of dist_norm 0.5 Reversed 0 Sorted Random

Current Implementation Protocols • Shared-Memory (OpenMP/Pthreads) • shared request queues • Message Passing (MPI-1.1) • sends/receives • Mixed-Mode • combination of MPI with threads • flat view of parallelism (for now) • take advantage of shared-memory

Cluster 4 Proc 12 Proc 13 Proc 14 Proc 15 STAPL Run-Time System • Scheduler • Determine an execution order (DDG) • Policies: • Automatic: Static, Block, Dynamic, Partial Self Scheduling, Complete Self Scheduling • User defined • Executor • Execute DDG • Processor assignment • Synchronization and Communication pAlgorithm Run-Time Cluster 1 Cluster 2 Cluster 3

ARMI: STAPL Communication Infrastructure ARMI: Adaptive Remote Method Invocation • abstraction of shared-memory and message passing communication layer • programmer expresses fine-grain parallelism that ARMI adaptively coarsens • support for sync, async, point-to-point and group communication ARMI can be as easy/natural as shared memory and as efficient as message passing

ARMI Communication Primitives • armi_sync • question: ask a thread something • blocking version • function doesn’t return until answer received from rmi • non-blocking version • function returns without answer • program can poll with rtnHandle.ready() and then access armi’s return value with rtnHandle.value() • collective operations • armi_broadcast, armi_reduce, etc. • can adaptively set groups for communication • arguments always passed by value

ARMI Synchronization Primitives • armi_fence, armi_barrier • tree-based barrier • implements distributed termination algorithm to ensure that all outstanding ARMI requests have been sent, received, and serviced • armi_wait • blocks until at least one (possibly more) ARMI request is received and serviced • armi_flush • empties local send buffer, pushing outstanding ARMI requests to remote destinations

Particle Transport Q: What is the particle transport problem? A: Particle transport is all about counting particles (such as neutrons). Given a physical volume we want to know how many particles there are and their locations, directions, and energies. Q: Why is it an important problem? A: Needed for the accurate simulation of complex physical systems such as nuclear reactions. Requires an estimated 50-80% of the total execution time in multi-physics simulations. Particle Transport is an important problem.

Transport Problem Applications • Oil Well Logging Tool • Shaft dug at possible well location • Radioactive sources placed in shaft with monitoring equipment • Simulation allows for verification of new techniques and tools

, the angular directions • R, the spatial domain • E, the energy variable for each direction in  for each grid cell in R for each energy group in E Discrete Ordinates Method Iterative method for solving the first-order form of the transport equation discretizes: Algorithm:

Discrete Ordinates Method

TAXI Algorithm

Transport Sweeps • Involves a sweep of the spatial grid for each direction in . • For orthogonal grids there are only eight distinct sweep orderings. • Note: A full transport sweep must process each direction.

Multiple Simultaneous Sweeps • One approach is to sequentially process each direction. • Another approach is to process each direction simultaneously. • Requires processors to sequentially process each direction during the sweep.

Sweep Dependence • Each sweep direction generates a unique dependence graph. A sweep starting from cell 1 is shown. • For example, cell 3 must wait until cell 1 has been processed and must be processed before cells 5 and 7. • Note thatall cells in the same diagonal plane can be processed simultaneously.

1 4 32 29 2 5 3 8 31 28 30 25 3 6 9 2 7 12 30 27 24 31 26 21 4 7 10 13 1 6 11 16 29 26 23 20 32 27 22 17 8 11 14 17 5 10 15 20 25 22 19 16 28 23 18 13 12 15 18 21 9 14 19 24 21 18 15 12 24 19 14 9 16 19 22 25 13 18 23 28 17 14 11 8 20 15 10 5 20 23 26 29 17 22 27 32 13 10 7 4 16 11 6 1 24 27 30 21 26 31 9 6 3 12 7 2 28 31 25 30 5 2 8 3 32 29 1 4 pRange Dependence Graph angle-set A angle-set B angle-set C angle-set D • Numbers are cellset indices • Colors indicate processors

angle-set A angle-set B angle-set C angle-set D 1 4 32 29 2 5 3 8 31 28 30 25 3 6 9 2 7 12 30 27 24 31 26 21 4 7 10 13 1 6 11 16 29 26 23 20 32 27 22 17 8 11 14 17 5 10 15 20 25 22 19 16 28 23 18 13 12 15 18 21 9 14 19 24 21 18 15 12 24 19 14 9 16 19 22 25 13 18 23 28 17 14 11 8 20 15 10 5 20 23 26 29 17 22 27 32 13 10 7 4 16 11 6 1 24 27 30 21 26 31 9 6 3 12 7 2 28 31 25 30 5 2 8 3 32 29 1 4 Adding a reflecting boundary

angle-set A angle-set B angle-set C angle-set D 1 4 32 29 2 5 3 8 31 28 30 25 3 6 9 2 7 12 30 27 24 31 26 21 4 7 10 13 1 6 11 16 29 26 23 20 32 27 22 17 8 11 14 17 5 10 15 20 25 22 19 16 28 23 18 13 12 15 18 21 9 14 19 24 21 18 15 12 24 19 14 9 16 19 22 25 13 18 23 28 17 14 11 8 20 15 10 5 20 23 26 29 17 22 27 32 13 10 7 4 16 11 6 1 24 27 30 21 26 31 9 6 3 12 7 2 28 31 25 30 5 2 8 3 32 29 1 4 Opposing reflecting boundary

Strong Scalability • System Specs • Large, dedicated IBM cluster at LLNL • 68 Nodes. • 16 375 Mhz Power 3 processors and 16GB RAM/node • Nodes connected by IBM SP switch • Problem Info • 64x64x256 grid • 6,291,456 unknowns

Work in Progress (Open Topics) • STAPL Algorithms • STAPL Adaptive Containers • ARMI v2 (multi-threaded, communication pattern library) • STAPL RTS -- K42 Interface • A Compiler for STAPL: • A high level, source to source compiler • Understands STAPL blocks • Optimizes composition • Automates composition • Generates checkers for STAPL programs

References • [1] "STAPL: An Adaptive, Generic Parallel C++ Library", Ping An, Alin Jula, Silvius Rus, Steven Saunders, Tim Smith, Gabriel Tanase, Nathan Thomas, Nancy Amato and Lawrence Rauchwerger, 14th Workshop on Languages and Compilers for Parallel Computing (LCPC), Cumberland Falls, KY, August, 2001. • [2] "ARMI: An Adaptive, Platform Independent Communication Library“, Steven Saunders and Lawrence Rauchwerger, ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), San Diego, CA, June, 2003. • [3] "Finding Strongly Connected Components in Parallel in Particle Transport Sweeps", W. C. McLendon III, B. A. Hendrickson, S. J. Plimpton, and L. Rauchwerger, in Thirteenth ACM Symposium on Parallel Algorithms and Architectures (SPAA), Crete, Greece, July, 2001. • [4] “A Framework for Adaptive Algorithm Selection in STAPL", N. Thomas, G. Tanase, O. Tkachyshyn, J. Perdue, N.M. Amato, L. Rauchwerger, in ACM SIGPLAN 2005 Symposium on Principles and Practice of Parallel Programming (PPOPP), Chicago, IL, June, 2005. (to appear)

STAPL: A High Productivity Programming Infrastructure for Parallel & Distributed Computing