Intel ® Threading Building Blocks

Intel® Threading Building Blocks

Agenda • Overview • Intel® Threading Building Blocks • Parallel Algorithms • Task Scheduler • Concurrent Containers • Sync Primitives • Memory Allocator • Summary Intel and the Intel logo are trademarks of Intel Corporation in the United States and other countries

Multi-Core is Mainstream • Gaining performance from multi-core requires parallel programming • Multi-threading is used to: • Reduce or hide latency • Increase throughput

Going Parallel

Intel® Threading Building Blocks Generic Parallel Algorithms Efficient scalable way to exploit the power of multi-core without having to start from scratch • Concurrent Containers • Common idioms for concurrent access • a scalablealternative to a serial container • with a lockaroundit TBB Flow Graph – New! Thread Local Storage Scalable implementation of thread-local data that supports infinite number of TLS Task scheduler The engine that empowers parallel algorithms that employs task-stealing to maximize concurrency Synchronization Primitives User-level and OS wrappers for mutual exclusion, ranging from atomic operations to several flavors of mutexes and condition variables Miscellaneous Thread-safe timers Threads OS API wrappers Memory Allocation Per-thread scalable memory manager and false-sharing free allocators

Intel® Threading Building BlocksExtend C++ for parallelism • Portable C++ runtime library that does thread management, letting developers focus on proven parallel patterns • Scalable • Composable • Flexible • Portable • Both GPL and commercial licenses are available. • http://threadingbuildingblocks.org *Other names and brands may be claimed as the property of others

Intel® Threading Building BlocksParallel Algorithms

Generic Parallel Algorithms Loop parallelization parallel_for, parallel_reduce, parallel_scan Load balanced parallel execution of fixed number of independent loop iterations Parallel Algorithms for Streams parallel_do, parallel_for_each, pipeline / parallel_pipeline Use for unstructured stream or pile of work Parallel function invocation parallel_invoke Parallel execution of a number of user-specified functions Parallel Sort parallel_sort Comparison sort with an average time complexity O(N Log(N))

Parallel Algorithm Usage Example ChangeArrayclass defines a for-loop body for parallel_for #include "tbb/blocked_range.h" #include "tbb/parallel_for.h“ using namespace tbb; class ChangeArray{ int* array; public: ChangeArray (int* a): array(a) {} void operator()( const blocked_range<int>& r ) const{ for (inti=r.begin();i!=r.end();i++ ){ Foo (array[i]); } } }; void ChangeArrayParallel (int* a, int n ) { parallel_for (blocked_range<int>(0, n), ChangeArray(a)); } int main (){ int A[N]; // initialize array here… ChangeArrayParallel(A, N); return 0; } blocked_range– TBB template representing 1D iteration space As usual with C++ function objects the main work is done inside operator() A call to a template function parallel_for<Range, Body>: with arguments Range  blocked_range Body  ChangeArray

parallel_for(Range(Data), Body(), Partitioner()); [Data, Data+N) [Data, Data+N/2) [Data+N/2, Data+N) [Data, Data+N/k) tasks available to thieves [Data, Data+GrainSize)

Two Execution Orders Depth First (stack) Breadth First (queue) Small space Excellent cache locality No parallelism Large space Poor cache locality Maximum parallelism

Work Depth First; Steal Breadth First • Best choice for theft! • big piece of work • data far from victim’s hot data. Second best choice. L2 L1 victim thread

#include "tbb/blocked_range.h" #include "tbb/parallel_for.h“ using namespace tbb; void ChangeArrayParallel (int* a, int n ) { parallel_for (0, n, 1, [=](inti) { Foo (a[i]); }); } int main (){ int A[N]; // initialize array here… ChangeArrayParallel (A, N); return 0; } parallel_for has an overload that takes start, stop and step argument and constructs blocked_range internally Capture variables by value from surrounding scope to completely mimic the non-lambda implementation. Note that [&] could be used to capture variables by reference . Using lambda expressions implement MyBody::operator() right inside the call to parallel_for(). C++0x Lambda Expression Support parallel_for example will transform into:

Functional parallelism has never been easier intmain(intargc, char* argv[]) { spin_mutex m; int a = 1, b = 2; parallel_invoke( foo, [a, b, &m](){ bar(a, b, m); }, [&m](){ for(int i = 0; i < K; ++i) { spin_mutex::scoped_lock l(m); cout << i << endl; } }, [&m](){ parallel_for( 0, N, 1, [&m](inti) { spin_mutex::scoped_lock l(m); cout << i << " "; }); }); return 0; } void function_handle(void) calling void bar(int, int, mutex)implemented using a lambda expression Serial thread-safe job, wrapped in a lambda expression that is being executed in parallel with three other functions Parallel job, which is also executed in parallel with other functions. Now imagine writing all this code with just plain threads

Strongly-typed parallel_pipeline input: void get new float output: float* input: float* float*float output: float input: float sum+=float2 output: void float RootMeanSquare( float* first, float* last ) { float sum=0; parallel_pipeline( /*max_number_of_tokens=*/16, make_filter<void,float*>( filter::serial, [&](flow_control& fc)-> float*{ if( first<last ) { return first++; } else { fc.stop();// stop processing return NULL; } } ) & make_filter<float*,float>( filter::parallel, [](float* p){return (*p)*(*p);} ) & make_filter<float,void>( filter::serial, [&sum](float x) {sum+=x;} ) ); /*sum=first2+(first+1)2 + … +(last-1)2 computed in parallel */ return sqrt(sum); } Call function tbb::parallel_pipeline to run pipeline stages (filters) Create pipeline stage object tbb::make_filter<InputDataType, OutputDataType>(mode, body) Pipeline stage mode can be serial, parallel, serial_in_order, or serial_out_of_order

Intel® Threading Building BlocksTask Scheduler

Task Scheduler Task scheduler is the engine driving Intel® Threading Building Blocks Manages thread pool, hiding complexity of native thread management Maps logical tasks to threads Parallel algorithms are based on task scheduler interface Task scheduler is designed to address common performance issues of parallel programming with native threads

Logical task – it is just a C++ class • Derive from tbb::task class • Implement execute() member function • Create and spawn root task and your tasks • Wait for tasks to finish #include “tbb/task_scheduler_init.h” #include “tbb/task.h” using namespace tbb; class ThisIsATask: public task { public: task* execute () { WORK (); return NULL; } };

Task Tree Example Depth Level root Thread 1 Thread 2 task Time waitforall() child1 child2 Intel® TBB wait calls don’t block calling thread! It blocks the task however. Intel TBB worker thread keeps stealing tasks while waiting Yellow arrows– Creation sequence Black arrows – Task dependency

Intel® Threading Building BlocksConcurrent Containers

Concurrent Containers Intel® TBB provides highly concurrent containers STL containers are not concurrency-friendly: attempt to modify them concurrently can corrupt container Wrapping a lock around an STL container turns it into a serial bottleneck and still does not always guarantee thread safety STL containers are inherently not thread-safe Intel TBB provides fine-grained locking or lockless implementations Worse single-thread performance, but better scalability. Can be used with the library, OpenMP*, or native threads. *Other names and brands may be claimed as the property of others

Concurrent Containers Key Features concurrent_hash_map <Key,T,Hasher,Allocator> Models hash table of std::pair <const Key, T> elements concurrent_unordered_map<Key,T,Hasher,Equality,Allocator> Permits concurrent traversal and insertion (no concurrent erasure) Requires no visible locking, looks similar to STL interfaces concurrent_vector <T, Allocator> Dynamically growable array of T: grow_by and grow_to_atleast concurrent_queue <T, Allocator> For single threaded run concurrent_queue supports regular “first-in-first-out” ordering If one thread pushes two values and the other thread pops those two values they will come out in the order as they were pushed concurrent_bounded_queue <T, Allocator> Similar to concurrent_queue with a difference that it allows specifying capacity. Once the capacity is reached ‘push’ will wait until other elements will be popped before it can continue. concurrent_priority_queue <T, Compare, Allocator> Similar to std::priority_queue with scalable pop and push oprations

Hash-map Examples #include <map> typedef std::map<std::string, int> StringTable; for (std::string* p=range.begin(); p!=range.end(); ++p) { tbb::spin_mutex::scoped_lock lock( global_lock ); table[*p] += 1; } #include "tbb/concurrent_hash_map.h" typedef concurrent_hash_map<std::string,int> StringTable; for (std::string* p=range.begin(); p!=range.end(); ++p) { StringTable::accessora; // local lock table.insert( a, *p ); a->second += 1;} } #include "tbb/concurrent_unordered_map.h“ typedef concurrent_unordered_map<std::string,atomic<int>> StringTable; for (std::string* p=range.begin(); p!=range.end(); ++p) { table[*p] += 1;// similar to STL but value is tbb::atomic<int> }

Intel® Threading Building BlocksSync Primitives

Synchronization Primitives Features Atomic Operations. High-level abstractions Exception-safe Locks spin_mutex is VERY FAST in lightly contended situations; use it if you need to protect very few instructions Use queuing_rw_mutex when scalability and fairness are important Use recursive_mutex when your threading model requires that one thread can re-acquire a lock. All locks should be released by one thread for another one to get a lock. Use reader-writer mutex to allow non-blocking read for multiple threads Portable condition variables

Example: spin_rw_mutex #include “tbb/spin_rw_mutex.h” using namespace tbb; spin_rw_mutexMyMutex; intfoo (){ // Construction of ‘lock’ acquires ‘MyMutex’ spin_rw_mutex::scoped_locklock (MyMutex, /*is_writer*/ false); … if (!lock.upgrade_to_writer ()) { … } else { … } return 0; // Destructor of ‘lock’ releases ‘MyMutex’ } • If exception occurs within the protected code block destructor will automatically release the lock if it’s acquired avoiding a dead-lock • Any reader lock may be upgraded to writer lock; upgrade_to_writer indicates whether the lock had to be released before it was upgraded

Intel® Threading Building BlocksScalable Memory Allocator

Scalable Memory Allocation Problem Memory allocation is a bottle-neck in concurrent environment Threads acquire a global lock to allocate/deallocate memory from the global heap Solution Intel®Threading Building Blocks provides tested, tuned, and scalable memory allocator optimized for all object sizes: Manual and automatic replacement of memory management calls C++ interface to use it with C++ objects as an underlying allocator (e.g. STL containers) Scalable memory pools

Memory API Calls Replacement Manual Change your code to call Intel® TBB scable_malloc/scalable_free instead of malloc and free Use scalable_* API to implement operators new and delete Use tbb::scalable_allocator<T> as an underlying allocator for C++ objects (e.g. STL containers) Automatic (Windows* and Linux*) Requires no code changes just re-link your binaries using proxy libraries Linux*: libtbbmalloc_proxy.so.2 or libtbbmalloc_proxy_debug.so.2 Windows*: tbbmalloc_proxy.dll or tbbmalloc_debug_proxy.dll

C++ Allocator Template Use tbb::scalable_allocator<T> as an underlying allocator for C++ objects Example: // STL container used with Intel® TBB scalable allocator std::vector<int, tbb::scalable_allocator<int> >;

Scalable Memory Pools #include "tbb/memory_pool.h" ... tbb::memory_pool<std::allocator<char> > my_pool(); void* my_ptr = my_pool.malloc(10); void* my_ptr_2 = my_pool.malloc(20); … my_pool.recycle(); // destructor also frees everything Allocate memory from the pool #include "tbb/memory_pool.h" ... char buf[1024*1024]; tbb::fixed_pool my_pool(buf, 1024*1024); void* my_ptr = my_pool.malloc(10); my_pool.free(my_ptr);} Allocate and free from a fixed size buffer

Scalable Memory Allocator Structure

Intel® TBB Memory Allocator Internals Small blocks Per-thread memory pools Large blocks Treat memory as “objects” of fixed size, not as ranges of address space. Typically several dozen (or less) object sizes are in active use Keep released memory objects in a pool and reuse when object of such size is requested Pooled objects “age” over time Cleanup threshold varies for different object sizes Low fragmentation is achieved using segregated free lists Intel TBB scalable memory allocator is designed for multi-threaded apps and optimized for multi-core

Intel® Threading Building Blocks Generic Parallel Algorithms Efficient scalable way to exploit the power of multi-core without having to start from scratch • Concurrent Containers • Common idioms for concurrent access • a scalablealternative to a serial container • with a lockaroundit TBB Graph Thread Local Storage Scalable implementation of thread-local data that supports infinite number of TLS Task scheduler The engine that empowers parallel algorithms that employs task-stealing to maximize concurrency Synchronization Primitives User-level and OS wrappers for mutual exclusion, ranging from atomic operations to several flavors of mutexes and condition variables Miscellaneous Thread-safe timers Threads OS API wrappers Memory Allocation Per-thread scalable memory manager and false-sharing free allocators

Supplementary Links Commercial Product Web Page www.intel.com/software/products/tbb Open Source Web Portal www.threadingbuildingblocks.org Knowledge Base, Blogs and User Forums http://software.intel.com/en-us/articles/intel-threading-building-blocks/all/1 http://software.intel.com/en-us/blogs/category/osstbb/ http://software.intel.com/en-us/forums/intel-threading-building-blocks Technical Articles: “Demystify Scalable Parallelism with Intel Threading Building Block’s Generic Parallel Algorithms” http://www.devx.com/cplus/Article/32935 “Enable Safe, Scalable Parallelism with Intel Threading Building Block's Concurrent Containers” http://www.devx.com/cplus/Article/33334 Industry Articles: Product Review: Intel Threading Building Blocks http://www.devx.com/go-parallel/Article/33270 “The Concurrency Revolution”, Herb Sutter, Dr. Dobb’s 1/19/2005 http://www.ddj.com/dept/cpp/184401916

Intel ® Threading Building Blocks