STAPL The C++ Standard Template Adaptive Parallel Library

STAPLThe C++ Standard Template Adaptive Parallel Library Alin Jula Department of Computer Science, Texas A&M Ping An, Silvius Rus, Steven Saunders, Tim Smith, Gabriel Tanase, Nathan Thomas, Nancy M. Amato and Lawrence Rauchwerger http://www.cs.tamu.edu/research/parasol

Motivation STAPL – C++ Standard Template Adaptive Parallel Library • Building block library • Nested parallelism • Inter-operability with existing code • Superset of STL • Portability and Performance • Layered architecture • Run-time adaptivity

Philosophy • Interface Layer • STL compatible • Concurrency & Communication Layer • Generic parallelism, synchronization • Software Implementation Layer • Instantiates concurrency & communication • Machine Layer • Architecture dependent code

Related Work * Parallel programming language

Iterator Algorithm Container STL Overview • Data is stored in Containers • STL provides standardized Algorithms • Iteratorsbind Algorithms to Containers • are generalized pointers • Example vector<int> vect; … // initialization of ‘vect’ variable sort(vect.begin(),vect.end());

STAPL Overview • Data is stored in pContainers • STAPL provides standardized pAlgorithms • pRanges bindpAlgorithms to pContainers • Similar to STL Iterators, but must also support parallelism

pRange • pRange is the Parallel Counterpart of STL iterator: • Binds pAlgorithms to pContainers • Provides an abstract view of a scoped data space • data space is (recursively) partitioned into subranges • More than an iterator since it supports parallelization • Scheduler/distributor decides how computation and data structures should be mapped to the machine • Data dependences among subranges can be represented by a data dependence graph (DDG) • Executor launches parallel computation, manages communication, and enforces dependences

pRange • Provides random accessto a partition of the data space • View and access provided by a collection of iterators describing pRange boundary • pRanges are partitioned into subranges • Automatically by STAPL based on machine characteristics, number of processors, partition factors, etc. • Manually according to user-specified partitions • pRange can represent relationships among subspaces as Data Dependence Graphs (DDG) ( for scheduling )

Data Space pRange • Each subspace is disjoint and could be itself a pRange • Nested parallelism stapl::pRange<stapl::pVector<int>::iterator> dataRange(segBegin, segEnd); dataRange.partition(); stapl::pRange<stapl::pVector<int>::iterator> dataSubrange = dataRange.get_subrange(3); dataSubrange.partition_like (<0.25,0.25,0.25,0.25> * size);

Data Space Prange pRange • Each subspace is disjoint and could be itself a pRange • Nested parallelism stapl::pRange<stapl::pVector<int>::iterator> dataRange(segBegin, segEnd); dataRange.partition(); stapl::pRange<stapl::pVector<int>::iterator> dataSubrange = dataRange.get_subrange(3); dataSubrange.partition_like (<0.25,0.25,0.25,0.25> * size);

Data Space Prange subspace subspace subspace subspace pRange • Each subspace is disjoint and could be itself a pRange • Nested parallelism stapl::pRange<stapl::pVector<int>::iterator> dataRange(segBegin, segEnd); dataRange.partition(); stapl::pRange<stapl::pVector<int>::iterator> dataSubrange = dataRange.get_subrange(3); dataSubrange.partition_like (<0.25,0.25,0.25,0.25> * size);

Data Space Prange subspace subspace Prange subspace pRange • Each subspace is disjoint and could be itself a pRange • Nested parallelism stapl::pRange<stapl::pVector<int>::iterator> dataRange(segBegin, segEnd); dataRange.partition(); stapl::pRange<stapl::pVector<int>::iterator> dataSubrange = dataRange.get_subrange(3); dataSubrange.partition_like (<0.25,0.25,0.25,0.25> * size);

pVector pRange STL vector pContainer • pContainer is the parallel counterpart of STL container • Provides parallel and concurrent methods • Maintains internal pRange • Updated during insert/delete operations • Minimizes redistribution • Completed: pVector, pList, pTree • Example:

pAlgorithm • pAlgorithm is the parallel counterpart of STL algorithm • Parallel Algorithms take as input • pRange • Work functions that operate on subRanges and apply the work function to all subranges template<class SubRange> class pAddOne : public stapl::pFunction { public: ... void operator()(SubRange& spr) { typename SubRange::iterator i; for (i=spr.begin(); i!=spr.end(); i++) (*i)++ } } ... p_transform(pRange, pAddOne);

Cluster 4 Proc 12 Proc 13 Proc 14 Proc 15 Run-Time System • Support for different architectures • HP V2200 • SGI Origin 2000, SGI Power Challenge • Support for different paradigms • OpenMP, Pthreads • MPI • Memory allocation • HOARD pAlgorithm Run-Time Cluster 1 Cluster 2 Cluster 3

Run-Time System • Scheduler • Determine an execution order (DDG) • Policies: • Automatic : Static, Block, Dynamic, Partial Self Scheduling, Complete Self Scheduling • User defined • Distributor • Hierarchical data distribution • Automatic and user defined • Executor • Execute DDG • Processor assignment • Synchronization and Communication

STL to STAPL Automatic Translation • C++ preprocessor converts STL code into STAPL parallel code • Iterators used to construct pRanges • User is responsible for safe parallelization #include <start_STAPL> accumulate(x.begin(), x.end(), 0); for_each(x.begin(), x.end(), foo()); #include <stop_STAPL> pi_accumulate(x.begin(), x.end(), 0); pi_for_each(x.begin(), x.end(), foo()); Preprocessing phase pRange construction • In some cases automatic • translation provides similar • performance to STAPL • written code (5% deterioration) p_accumulate(x_pRange, 0); p_for_each(x_pRange,foo());

Performance: p_inner_product Experimental results on HP V2200

Base (atomic) Subtrees (parallel) P2 P3 P1 pTree • Parallel Tree supports bulk commutative operations in parallel • Each processor is assigned a set of subtrees to maintain • Operations on the base are atomic • Operations on subtrees are parallel Example : Parallel Insertion Algorithm Each processor is given a set of elements • Each proc creates local buckets corresponding to the subtrees • Each processor collects the buckets that correspond to its subtrees • Elements in the subtree buckets are inserted into tree in parallel

pTree • Basis for STAPL pSet, pMultiSet, pMap, pMultiMap containers • Covers all remaining STL containers • Results are sequentially consistent although internal structure may vary • Requires negligible additional memory • pTrees can be used either sequentially or in parallel in the same execution • allows switching back and forth between parallel & sequential

Performance: pTree Experimental results on HP V2200

Algorithm Adaptivity • Problem - Parallel algorithms are highly sensitive • Architecture – number of processors, memory interconnection, cache, available resources, etc • Environment – thread management, memory allocation, operating system policies, etc • Data Characteristics – input type, layout, etc • Solution - implement a number of different algorithms and adaptively choose the best one at run-time

Adaptive Framework

Case Study - Adaptive Sorting

Performance: Adaptive Sorting V2200 Power Challenge Origin 2000 Performance on 10 million integers

Performance: Run-Time Tests Origin 2000 if (data_type = INTEGER) radix_sort(); else if (num_procs < 5) merge_sort(); else column_sort();

Performance: Molecular Dynamics* • Discrete time particle interaction simulation • Written in STL • Time steps calculate system evolution (dependence) • Parallelized within time step • STAPL utilization: • pAlgorithms: p_for_each, p_transform, p_accumulate • pContainers: pVector (push_back) • Automatic vs. Manual (5% performance deterioration ) * Code written by Danny Rintoul at Sandia National Labs

Execution Time (sec) Number of particles Number of processors 1 4 8 12 16 108K 23k 2815 1102 546 386 309 627 238 132 94 86 Performance: Molecular Dynamics • 40%-49% parallelized • Input sensitive • Use pTree on rest Experimental results on HP V2200

Performance - Particle Transport* • Generic particle transport solver • Regular and arbitrary grids • Numerically intensive, 25k line, C++ STAPL code • Sweep function unaware of parallel issues • STAPL utilization: • pAlgorithms: p_for_each • pContainers: pVector (for data distribution) • Scheduler: determine grid data dependencies • Executor: satisfy data dependencies * Joint effort between Texas A&M Nuclear Engineering and Computer Science, funded by DOE ASCI

Performance - Particle Transport Profile and Speedups on SGI Origin 2000 using 16 processors

Performance - Particle Transport Experimental results on SGI Origin 2000

Summary • Parallel equivalent to STL • Many codes can immediately utilize STAPL • Automatic translation • Building block library • Portability (layered architecture) • Performance (adaptive) • Automatic recursive parallelism • STAPL performs well in small pAlgorithm test cases and large codes

STAPL Status and Current Work • pAlgorithms - fully implemented • pContainers - pVector, pList, pTree • pRange - mostly implemented • Run-Time • Executor fully implemented • Scheduler fully implemented • Distributor work in progress • Adaptive mechanism (case study – sorting) • OpenMP + MPI (mixed) work in progress • OpenMP version fully implemented • MPI version work in progress

http://www.cs.tamu.edu/research/parasol • Project funded by • NSF • DOE

STAPL The C++ Standard Template Adaptive Parallel Library

STAPL The C++ Standard Template Adaptive Parallel Library

Presentation Transcript

The C Language

The C-Spine

The C-cycle

The “C” Word

The 7’c

The ‘C’ Captains

“Finding the C in C ollaboration”

The C linical C ompetency C ommittee (CCC)

”the 4th C ”

The C Language

The Bronze C

The C ryogenic

The C Language

The Letter c

The letter C

The C Preprocessor

The C Language

The C-Grid

SmartApps: Application Centric Computing with STAPL

The C Programming