1 / 50

STAPL: A High Productivity Programming Infrastructure for Parallel & Distributed Computing

STAPL: A High Productivity Programming Infrastructure for Parallel & Distributed Computing. Lawrence Rauchwerger Parasol Lab, Dept of Computer Science Texas A&M University http://parasol.tamu.edu/~rwerger/. Motivation . Parallel programming is Costly Parallel programs are not portable

ewan
Télécharger la présentation

STAPL: A High Productivity Programming Infrastructure for Parallel & Distributed Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. STAPL: A High Productivity Programming Infrastructure for Parallel & Distributed Computing Lawrence Rauchwerger Parasol Lab, Dept of Computer Science Texas A&M University http://parasol.tamu.edu/~rwerger/

  2. Motivation • Parallel programming is Costly • Parallel programs are not portable • Scalability & Efficiency is (usually) poor • Dynamic programs are even harder • Small scale parallel machines: ubiquitous

  3. Our Approach: STAPL • STAPL: Parallel components library • Extensible, open ended • Parallel superset of STL • Sequential inter-operability • Layered architecture: User – Developer - Specialist • Extensible • Portable (only lowest layer needs to be specialized) • High Productivity Environment • components have (almost) sequential interfaces.

  4. STAPL Specification • STL Philosophy • Shared Object View • User Layer: No explicit communication • Machine Layer: Architecture dependent code • Distributed Objects • no replication • no software coherence • Portable efficiency • Runtime System virtualizes underlying architecture. • Concurrency & Communication Layer • SPMD (for now) parallelism

  5. goal start obstacles Motion Planning STAPL Applications • Motion Planning • Probabilistic Roadmap Methods for motion planning with application to protein folding, intelligent CAD, animation, robotics, etc. • Molecular Dynamics • A discrete event simulation that computes interactions between particles. Protein Folding

  6. Seismic Ray Tracing STAPL Applications • Particle Transport Computation • Efficient Massively Parallel Implementation of Discrete Ordinates Particle Transport Calculation. • Seismic Ray Tracing • Simulation of propagation of seismic rays in earth’s crust. Particle Transport Simulation

  7. STAPL Overview pRange Runtime System pContainer Scheduler Executor pAlgorithm • Data is stored in pContainers • Parallel equivalents of all STL containers & more (e.g., pGraph) • STAPL provides generic pAlgorithms • Parallel equivalents of STL algorithms & more (e.g., list ranking) • pRangesbindpAlgorithms to pContainers • Similar to STL iterators, but also support parallelism

  8. STAPL Overview • pContainers • pRange • pAlgorithms • RTS & ARMI Communication Infrastructure • Applications using STAPL

  9. pContainer Overview pContainer:A distributed (no replication) data structure with parallel (thread-safe) methods • Ease of Use • Shared Object View • Handles data distribution and remote data access internally (no explicit communication) • Efficiency • De-centralized distribution management • OO design to optimize specific containers • Minimum overhead over STL containers • Extendability • A set of base classes with basic functionality • New pContainers can be derived from Base classes with extended and optimized functionality

  10. Non-partitioned Shared Memory View of Data STAPL pContainer Partitioned Shared Memory View of Data STAPL RTS and ARMI Data In Shared Memory Data In Distributed Memory Data In Distributed Memory pContainer Layered Architecture • pContainer provides different views for users with different needs/levels of expertise • Basic User view: • a single address space • interfaces similar to STL containers • Advanced User view: • access to data distribution info to optimize methods • can provide customized distributions that exploit knowledge of application

  11. pContainer Design • Base Sequential Container • STL Containers used to store data • Distribution Manager • provides shared object view • BasePContainer

  12. STAPL Overview • pContainers • pRange • pAlgorithms • RTS & ARMI Communication Infrastructure • Applications using STAPL

  13. pRange Overview • Interface between pAlgorithms and pContainers • pAlgorithms expressed in terms of pRanges • pContainers provide pRanges • Similar to STL Iterator • Parallel programming support • Expression of computation as parallel task graph • Stores DDGs used in processing subranges • Less abstract than STL iterator • Access to pContainer methods • Expresses the Data—Task Parallelism Duality

  14. pRange • View of a work space • Set of tasks in a parallel computation • Can be recursively partitioned into subranges • Defined on disjoint portions of the work space • Leaf subrange in the hierarchy • Represents a single task • Smallest schedulable entity • Task: • Function object to apply • Using same function for all subranges results in SPMD • Description of the data to which function is applied

  15. pRange Example pRange defined on application data Application data stored in pMatrix • Each subrange is a task • Boundary of each subrange is a set of cut edges • Data from several threads in subrange • If pRange partition matches data distribution then data access is all local Subrange 1 Subrange 2 Thread 2 Thread 1 Thread 1 Function Dependences between subranges Function Thread 2 Subrange 3 Subrange 5 Subrange 6 Subrange 4 Function Function Function Function

  16. pRange Example • Subranges of pRange • Matrix elements in several subranges • Each subrange has a function object • Each subrange has a boundary and a function object • Data from several threads in subrange • pMatrix is distributed • If subrange partition matches data distribution then all data access is local • DDGs can be defined on subranges of the pRange and on elements inside each subrange • No DDG is shown here • Partitioning of subrange • Subranges can be recursively partitioned • Each subrange has a function object

  17. Overview • pContainers • pRange • pAlgorithms • RTS & ARMI Communication Infrastructure • Applications using STAPL

  18. pAlgorithms • pAlgorithm is a set of parallel task objects • input for parallel tasks specified by the pRange • (Intermediate) results stored in pContainers • ARMI for communication between parallel tasks • pAlgorithms in STAPL • Parallel counterparts of STL algorithms provided in STAPL • STAPL contains additional parallel algorithms • List ranking • Parallel Strongly Connected Components • Parallel Euler Tour • etc

  19. Algorithm Adaptivity in STAPL • Problem: Parallel algorithms highly sensitive to: • Architecture – number of processors, memory interconnection, cache, available resources, etc • Environment – thread management, memory allocation, operating system policies, etc • Data Characteristics – input type, layout, etc • Solution: adaptively choose the best algorithm from a library of options at run-time • Adaptive Patterns ?

  20. Adaptive Framework Installation Benchmarks Overview of Approach • GivenMultiple implementation choices for the same high level algorithm. • STAPL installationAnalyze each pAlgorithm’s performance on system and create a selection model. • Program executionGather parameters, query model, and use predicted algorithm. Data Repository Architecture & Algorithm Environment Performance Model STAPL User Parallel Algorithm Choices Code Adaptive Executable Data Characteristics Run-Time Tests Selected Algorithm

  21. Model generation • Installation Benchmarking • Determine parameters that may affect performance(i.e., num procs, input size, algorithm specific…) • Run all algorithms on a sampling of instance space • Insert timings from runs into performance database • Create a Decision Model • Generic interface enables learners to compete • Currently: decision tree, neural net, Bayes naïve classifier • Based on predicted accuracy (10-way validation test). • “Winning” learner outputs query function in C++func* predict_pAlgorithm(attribute1, attributes2, ..)

  22. Runtime Algorithm Selection • Gather parameters • Immediately available (e.g., num procs) • Computed (e.g., disorder estimate for sorting) • Query model and execute • Query function statically linked at compile time.Current work: dynamic linking with online model refinement.

  23. Experiments • Investigated two operations • Parallel Sorting • Parallel Matrix Multiplication • Three Platforms • 128 processor SGI Altix • 1152 nodes, dual processor Xeon Cluster • 68 nodes, 16 way IBM SMP Cluster

  24. Parallel Sorting Algorithms • Sample Sort • Samples to define processor bucket thresholds • Scan and distribute elements to buckets • Each processor sort local elements • Radix Sort • Parallel version of linear time sequential approach. • Passes over data multiple times, each time considering r bits. • Column Sort • O(lg n) on running time • Requires 4 local sorts and 4 communication steps • Uses pMatrix data structure for workspace

  25. Attributes used to model sorting decision Processor Count Data Type Input Size Max Value Smaller value ranges may favor radix sort by reducing required passes Presortedness Generate data by varying initial state (sorted, random, reversed) and % displacement Quantify at runtime with normalized average distance metric derived from input sampling Sorting Attributes

  26. Training Set Creation • 1000 Training inputs per platform by uniform random sampling of space defined below: *P = 64 linux cluster, frost **only for sorted and reverse sorted

  27. Model Error Rate • Model accuracy with all training inputs is 94%, 98% and 94% on Cluster, Altix, and SMP Cluster

  28. Altix Cluster SMPCluster F(p, dist_norm) F(p, n, dt, dist_norm, max) F(p, n, dt, dist_norm, max)

  29. Parallel Sorting: Experimental Results Altix Model Decision Tree Validation (N=120M) on Altix Interpretation of dist_norm 0.5 Reversed 0 Sorted Random

  30. Overview • pContainers • pRange • pAlgorithms • RTS & ARMI Communication Infrastructure • Applications using STAPL

  31. Current Implementation Protocols • Shared-Memory (OpenMP/Pthreads) • shared request queues • Message Passing (MPI-1.1) • sends/receives • Mixed-Mode • combination of MPI with threads • flat view of parallelism (for now) • take advantage of shared-memory

  32. Cluster 4 Proc 12 Proc 13 Proc 14 Proc 15 STAPL Run-Time System • Scheduler • Determine an execution order (DDG) • Policies: • Automatic: Static, Block, Dynamic, Partial Self Scheduling, Complete Self Scheduling • User defined • Executor • Execute DDG • Processor assignment • Synchronization and Communication pAlgorithm Run-Time Cluster 1 Cluster 2 Cluster 3

  33. ARMI: STAPL Communication Infrastructure ARMI: Adaptive Remote Method Invocation • abstraction of shared-memory and message passing communication layer • programmer expresses fine-grain parallelism that ARMI adaptively coarsens • support for sync, async, point-to-point and group communication ARMI can be as easy/natural as shared memory and as efficient as message passing

  34. ARMI Communication Primitives • armi_sync • question: ask a thread something • blocking version • function doesn’t return until answer received from rmi • non-blocking version • function returns without answer • program can poll with rtnHandle.ready() and then access armi’s return value with rtnHandle.value() • collective operations • armi_broadcast, armi_reduce, etc. • can adaptively set groups for communication • arguments always passed by value

  35. ARMI Synchronization Primitives • armi_fence, armi_barrier • tree-based barrier • implements distributed termination algorithm to ensure that all outstanding ARMI requests have been sent, received, and serviced • armi_wait • blocks until at least one (possibly more) ARMI request is received and serviced • armi_flush • empties local send buffer, pushing outstanding ARMI requests to remote destinations

  36. Overview • pContainers • pRange • pAlgorithms • RTS & ARMI Communication Infrastructure • Applications using STAPL

  37. Particle Transport Q: What is the particle transport problem? A: Particle transport is all about counting particles (such as neutrons). Given a physical volume we want to know how many particles there are and their locations, directions, and energies. Q: Why is it an important problem? A: Needed for the accurate simulation of complex physical systems such as nuclear reactions. Requires an estimated 50-80% of the total execution time in multi-physics simulations. Particle Transport is an important problem.

  38. Transport Problem Applications • Oil Well Logging Tool • Shaft dug at possible well location • Radioactive sources placed in shaft with monitoring equipment • Simulation allows for verification of new techniques and tools

  39. , the angular directions • R, the spatial domain • E, the energy variable for each direction in  for each grid cell in R for each energy group in E Discrete Ordinates Method Iterative method for solving the first-order form of the transport equation discretizes: Algorithm:

  40. Discrete Ordinates Method

  41. TAXI Algorithm

  42. Transport Sweeps • Involves a sweep of the spatial grid for each direction in . • For orthogonal grids there are only eight distinct sweep orderings. • Note: A full transport sweep must process each direction.

  43. Multiple Simultaneous Sweeps • One approach is to sequentially process each direction. • Another approach is to process each direction simultaneously. • Requires processors to sequentially process each direction during the sweep.

  44. Sweep Dependence • Each sweep direction generates a unique dependence graph. A sweep starting from cell 1 is shown. • For example, cell 3 must wait until cell 1 has been processed and must be processed before cells 5 and 7. • Note thatall cells in the same diagonal plane can be processed simultaneously.

  45. 1 4 32 29 2 5 3 8 31 28 30 25 3 6 9 2 7 12 30 27 24 31 26 21 4 7 10 13 1 6 11 16 29 26 23 20 32 27 22 17 8 11 14 17 5 10 15 20 25 22 19 16 28 23 18 13 12 15 18 21 9 14 19 24 21 18 15 12 24 19 14 9 16 19 22 25 13 18 23 28 17 14 11 8 20 15 10 5 20 23 26 29 17 22 27 32 13 10 7 4 16 11 6 1 24 27 30 21 26 31 9 6 3 12 7 2 28 31 25 30 5 2 8 3 32 29 1 4 pRange Dependence Graph angle-set A angle-set B angle-set C angle-set D • Numbers are cellset indices • Colors indicate processors

  46. angle-set A angle-set B angle-set C angle-set D 1 4 32 29 2 5 3 8 31 28 30 25 3 6 9 2 7 12 30 27 24 31 26 21 4 7 10 13 1 6 11 16 29 26 23 20 32 27 22 17 8 11 14 17 5 10 15 20 25 22 19 16 28 23 18 13 12 15 18 21 9 14 19 24 21 18 15 12 24 19 14 9 16 19 22 25 13 18 23 28 17 14 11 8 20 15 10 5 20 23 26 29 17 22 27 32 13 10 7 4 16 11 6 1 24 27 30 21 26 31 9 6 3 12 7 2 28 31 25 30 5 2 8 3 32 29 1 4 Adding a reflecting boundary

  47. angle-set A angle-set B angle-set C angle-set D 1 4 32 29 2 5 3 8 31 28 30 25 3 6 9 2 7 12 30 27 24 31 26 21 4 7 10 13 1 6 11 16 29 26 23 20 32 27 22 17 8 11 14 17 5 10 15 20 25 22 19 16 28 23 18 13 12 15 18 21 9 14 19 24 21 18 15 12 24 19 14 9 16 19 22 25 13 18 23 28 17 14 11 8 20 15 10 5 20 23 26 29 17 22 27 32 13 10 7 4 16 11 6 1 24 27 30 21 26 31 9 6 3 12 7 2 28 31 25 30 5 2 8 3 32 29 1 4 Opposing reflecting boundary

  48. Strong Scalability • System Specs • Large, dedicated IBM cluster at LLNL • 68 Nodes. • 16 375 Mhz Power 3 processors and 16GB RAM/node • Nodes connected by IBM SP switch • Problem Info • 64x64x256 grid • 6,291,456 unknowns

  49. Work in Progress (Open Topics) • STAPL Algorithms • STAPL Adaptive Containers • ARMI v2 (multi-threaded, communication pattern library) • STAPL RTS -- K42 Interface • A Compiler for STAPL: • A high level, source to source compiler • Understands STAPL blocks • Optimizes composition • Automates composition • Generates checkers for STAPL programs

  50. References • [1] "STAPL: An Adaptive, Generic Parallel C++ Library",  Ping An, Alin Jula, Silvius Rus, Steven Saunders, Tim Smith, Gabriel Tanase, Nathan Thomas, Nancy Amato and Lawrence Rauchwerger,  14th Workshop on Languages and Compilers for Parallel Computing (LCPC),  Cumberland Falls, KY, August, 2001. • [2] "ARMI: An Adaptive, Platform Independent Communication Library“, Steven Saunders and Lawrence Rauchwerger, ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), San Diego, CA, June, 2003. • [3] "Finding Strongly Connected Components in Parallel in Particle Transport Sweeps", W. C. McLendon III, B. A. Hendrickson, S. J. Plimpton, and L. Rauchwerger, in Thirteenth ACM Symposium on Parallel Algorithms and Architectures (SPAA), Crete, Greece, July, 2001. • [4] “A Framework for Adaptive Algorithm Selection in STAPL", N. Thomas, G. Tanase, O. Tkachyshyn, J. Perdue, N.M. Amato, L. Rauchwerger, in ACM SIGPLAN 2005 Symposium on Principles and Practice of Parallel Programming (PPOPP), Chicago, IL, June, 2005. (to appear)

More Related