Intel Threading Building Blocks TBB
Intel Threading Building Blocks TBB. http://www.threadingbuildingblocks.org. Overview. TBB enables you to specify tasks instead of threads TBB targets threading for performance TBB is compatible with other threading packages TBB emphasize scalable, data parallel programming
Intel Threading Building Blocks TBB
E N D
Presentation Transcript
Intel Threading Building Blocks TBB http://www.threadingbuildingblocks.org
Overview • TBB enables you to specify tasks instead of threads • TBB targets threading for performance • TBB is compatible with other threading packages • TBB emphasize scalable, data parallel programming • TBB relies on generic programming
Example 1: Compute Average of 3 numbers void SerialAverage(float *output,float *input, size_t) { for(size_t i=0 ; I < n ; i++) { output[i] = (input[i-1]+input[i]+input[i+1])*(1/3.0f); } }
#include "tbb/parallel_for.h" #include "tbb/blocked_range.h" #include "tbb/task_scheduler_init.h" using namespace tbb; class Average { public: float* input; float* output; void operator()( const blocked_range<int>& range ) const { for( int i=range.begin(); i!=range.end(); ++i ) output[i] = (input[i-1]+input[i]+input[i+1])*(1/3.0f); } }; // Note: The input must be padded such that input[-1] and input[n] // can be used to calculate the first and last output values. void ParallelAverage( float* output, float* input, size_t n ) { Average avg; avg.input = input; avg.output = output; parallel_for( blocked_range<int>( 0, n, 1000 ), avg ); }
//! Problem size const int N = 100000; int main( int argc, char* argv[] ) { float output[N]; float raw_input[N+2]; raw_input[0] = 0; raw_input[N+1] = 0; float* padded_input = raw_input+1; task_scheduler_init ; ............ ............ ParallelAverage(output, padded_input, N); }
Serial Parallel
Effect of Grain Size on A[i]=B[i]*c Computation (one million indices)
Reduction Serial Parallel
Parallel Scan with Partitioner • Parallel_scan, breaks a range into subranges and computes a • partial result in each subrange in parallel. • Then, the partial result for subrange k is used to update the • information in subrange k+1, starting from k=0 and • proceeding sequentially up to the last subrange. • Finally, each subrange uses its updated information to compute its • final result in parallel with all the other subranges.
Linked List Example • Assume Foo takes at least a few thousand instructions to run, • then it is possible to get speedup by parallelizing
Parallel_while • Requires two user-defined objects • Object that defines the stream of objects • - must have pop_if_present • - pop_if_present need not be thread safe • - nonscalable, since fetching is serialized • - possible to get useful speedup • 2. Object that defines the loop body i.e. the operator()
Parallelized Foo Acting on Linked List • Note: the body of parallel_while can add more work by calling • w.add(item)
Pipelining • Single pipeline
Pipelining • Parallel pipeline
Trick soln: Use Topologically Sorted Pipeline • Note that the latency is increased