Analysis of Cilk

Analysis of Cilk

A Formal Model for Cilk • A thread: maximal sequence of instructions in a procedure instance (at runtime!) not containing spawn, sync, return • For a given computation, define a dag • Threads are vertices • Continuation edges within procedures • Spawn edges • Initial & final threads (in main)

Work & Critical Path • Threads are sequential: work = running time • Define TP = running time on P processors • Then T1 = work in the computation • And T∞ = critical-path length, longest path in the dag

Lower Bounds on TP • TP≥T1 / P (no miracles in the model, but they do happen occasionally) • TP≥T∞ (dependencies limit parallelism) • Speedup is T1 / TP • (Asymptotic) linear speedup means Θ(P) • Parallelism is T1 / T∞ (average work available at every step along critical path)

Greedy Schedulers • Execute at most P threads in every step • Choice of threads is arbitrary (greedy) • At most T1 / P complete steps (everybody busy) • At most T∞ incomplete steps • All threads with in-degree = 0 are executed • Therefore, critical path length reduced by 1 • Theorem:TP≤T1 / P + T∞ • Linear speedup when P = O(T1 / T∞)

Cilk Guarantees: • TP≤T1 / P + O(T∞) expected running time • Randomized greedy scheduler • The developers claim: TP≈T1 / P + T∞ in practice • This implies near-perfect speedupwhen P << T1 / T∞

But Which Greedy Schedule to Choose? • Busy leaves property: some processor executes every leaf in the dag • Busy leaves controls space consumption;can show SP = O(P S1) • Without busy leaves, worst case is SP = Θ(T1), not hard to reach • A processor that spawns a procedure executes it immediately; but another processor may steal the caller & execute it

Work Stealing • Idle processors search for work on other processors at random • When a busy victim is found, the theif steals the top activation frame on victim’s stack • (In practice, stack is maintained as a dequeue) • Why work stealing? • Almost no overhead when everybody busy • Most of the overhead incurred by theives • Work-first principle: little impact on T1 / P term

Some Overhead Remains • To achieve portability, spawn stack is maintained as an explicit data structure • Theives steal from this dequeue • Could implement stealing directly from stack, more complex and nonportable • Main consequence: • Spawns are more expensive than function calls • Must do significant work at the bottom of recursions to hide this overhead • Same is true for normal function calls, but to a lesser extent

More Cilk Features

Inlets • x = spawn fib(n-1)is equivalent to: cilk int fib(int n) { int x = 0; inlet void summer(int result) { x += result; } if (n<2) return n; summer( spawn fib(n-1) ); summer( spawn fib(n-2) ); sync; return x;

Inlet Semantics • Inlet: an inner function that is called when a child completes its execution • An inlet is a thread in a procedure (so spawn, sync are not allowed) • All the threads of a procedure instance are executed atomically with respect to one another (i.e., not concurrently) • Easy to reason about correctness • x += spawn fib(n-1)is an implicit inlet

Aborting Work • An abort statement in an inlet aborts already-spawned children of a procedure • Useful for aborting speculative searches • Semantics: • Children may not abort instantly • Aborted children do not return values, so don’t use these values, as in x = spawn fib(n-1) • Does not prevent future spawns; be careful with sequences of spawns

The SYNCHED Built-In Variable • True only if no children are currently executing • False if some children may be executing now • Useful for avoiding space and work overheads that reduce the critical path when there is no need to

Cilk’s Memory Model • Memory operations of two threads are guaranteed to be ordered only if there is a dependence path between them (ancestor-descendant relationship) • Unordered threads may see inconsistent views of memory

Locks • Mutual-exclusion variables • Memory operations that a thread performs before releasing a lock are seen by other threads after they acquire the lock • Using locks invalidates all the performance guarantees that Cilk provides • In short, Cilk supports locks but don’t use them unless you must

Useful but Obsolete • Cilk as a library • Can call Cilk procedures from C, C++, Fortran • Necessary for building general-purpose C libraries • Cilk on clusters with distributed memory • Programmer sees the same shared-memory model • Used an interesting memory-consistency protocol to support shared-memory view • Was performance ever good enough?

Some Open Problems Perhaps good enough for a thesis

Open Issues in Cilk • Theoretical question about the distributed-memory version: is performance monotone in the size of local caches? • Cilk as a library: resurrect • Distributed-memory version: resurrect, is it fast enough? Can you make it faster?

Parallel Merge Sortin Cilk

Parallel Merge Sort merge_sort(A, n) if (n=1) return spawn merge_sort(A, n/2) spawn merge_sort(A+n/2, n-n/2) sync merge(A, n/2, n-n/2)

Can’t Merge In Place! merge_sort(A, T, n, AorT) if (n=1) { T0=A0 ; return } spawn merge_sort(A, T, n/2, !TorA) spawn merge_sort(A+n/2, n-n/2, !TorA) sync if (TorA=A)merge(A, T, n/2, n-n/2) if (TorA=T)merge(T, A, n/2, n-n/2)

Analysis • Merging uses two pointers, move the smaller into sorted array • T1(n) = 2T1(n/2) + Θ(n) • T1(n) = Θ(n log n) • We fill the output element by element • T∞(n) = T∞(n/2) + Θ(n) • T∞(n) = Θ(n) • Not very parallel . . .

Parallel Merging p_merge(A,n,B,m,C) // C is the output swap A,B if A is smaller if (m+n = 1) { C0=A0 ; return } if (n = 1 /* implies m = 1 */) { merge ; return } locate An/2 between Bj and Bj+1 (binary search) spawn p_merge(A,n/2,B,j,C) spawn p_merge(A+n/2, n-n/2, B+ j, n-j, C+n/2+j) sync

Analysis of Parallel Merging • When we merge n elements, both recursive calls merge at most 3n/4 elements • T∞(n) ≤T∞(3n/4) + Θ(log n) • T∞(n) = Θ(log2n) • Critical path is short! • But the analysis of work is more complex (extra work due to binary searches) • T1(n) = T1(αn) + T1((1-α)n) + Θ(log n), ¼ ≤ α ≤ ¾ • T1(n) = Θ(n) using substitution (nontrivial) • Critical path for parallel merge sort • T∞(n) = T∞(n/2) + Θ(log2n) = Θ(log3n)

Analysis of ParallelMerge Sort • T∞(n) = T∞(n/2) + Θ(log2n) = Θ(log3n) • T1(n) = Θ(n) • Impact of extra work in practice? • Can find the median of 2 sorted arrays of total size n in Θ(log n) time, leads to parallel merging and merge-sorting with shorter critical paths

Analysis of ParallelMerge Sort • T∞(n) = T∞(n/2) + Θ(log2n) = Θ(log3n) • T1(n) = Θ(n) • Impact of extra work in practice? • Can find the median of 2 sorted arrays of total size n in Θ(log n) time, leads to parallel merging and merge-sorting with shorter critical paths • Parallelizing an algorithm can be nontrivial!

Cache-EfficientSorting

Caches • Store recently-used data • Not really LRU • Usually 1, 2, or 4-way set associative • But up to 128-way set associative • Data transferred in blocks called cache lines • Write through or write back • Temporal locality: use same data again soon • Spatial locality: use nearby data soon

Cache Misses inMerge Sort • Assume cache-line size = 1, LRU, write back • Assume cache holds M words • When n <= M/2, exactly n reads, write backs • When n > M, at least n cache misses (cover all cases in the proof!) • Therefore, number of cache misses is Θ(n log n/M) = Θ(n log n – log M) • We can do much better

The Key Idea • Merge M/2 sorted runs into one, not 2 into 1 • Keep one element from each run in a heap, together with a run label • Extract the min, move to sorted run, insert another element from same run into heap • Reading from sorted runs & writing to sorted ouput removes elements of the heap, but this cost is O(n) cache misses • Θ(n logMn) = Θ(n log n / log M)

This is Poly-Merge Sort • Optimal in terms of cache misses • Can adapt to long cache lines, sorting on disks, etc • Originally invented from sorting on tapes on a machine with several tape drives • Often, Θ(n log n / log M) is really Θ(n) in practice • Example: • 32 KB cache, 4+4 bytes elements • 4192-way merges • Can sort 64 MB of data in 1 merge, 256 GB in 2 merges • But more merges with long cache lines

From Quick Sort To Sample Sort • Same number of cache misses with normal quick sort • Key idea • Choose a large random sample, Θ(M) elements • Sort the samples • Classify all the elements using binary searches • Determine size of intervals • Partition • Recursively sort the intervals • Cache miss # probably similar to merge sort

Distributed-MemorySorting

Issues in Sample Sort • Main idea: Partition input into P intervals, classify elements, send elements in ith interval to processor i, sort locally • Most of the communication in one global all-to-all phase • Load balancing: Intervals must besimilar in size • How do we sort the sample?

Balancing the Load • Select a random sample of sP elements • OK even if every processor selects s • The probability of an interval larger than cn/P grows linearly with n, shrinks exponentially with s; a large s virtually ensures uniform intervals • Example: • n = 109, s = 256 • Pr[max interval > 2n/P] < 10-8

Sorting the Sample • Can’t do it recursively! • Sending the sample to one processor • Θ(sP+n/P) communication in that processor • Θ(sP log sP+n/P log( n/P )) work • Not scalable, OK for small P • Using a different algorithm that works well for small n/P, e.g., radix sort

Distributed-MemoryRadix Sort • Sort blocks of r bits from least significant to most significant; use a stable sort • Counting sort of one block • Sequentially, count occurrences using a 2r array, compute prefix sums, use as pointers • In parallel, every processor counts occurrences, then pipelined parallel prefix sums, send elements to destination processor • Θ((b/r) (2r + n/P) work & comm / proc

Odds and Ends

CPU Utilization Issues • Avoid conditionals • In sorting algorithms, the compiler & processors cannot predict outcome, so the pipeline stalls • Example: partitioning in qsort without conditionals • Avoid stalling the pipeline • Even without conditionals, rapid use of computed values stalls the pipeline • Same example

Dirty Tricks • Exploit uniform input distributions approximately (quick sort, radix sort) • Fix mistakes by bubbling • To avoid conditionals when fixing mistakes • Split the array into small blocks • Use and’s to check for mistakes in a block • Fix a block only if it contains mistakes • Some conditionals, but not many • Compute probability of mistakes to optimize

The Exercise • Get sort.cilk, more instuctions inside • Convert sequential merge sort into parallel merge sort • Make it as fast as possible as long is it is a parallel merge sort (e.g., make the bottom of the recursion fast) • Convert the fast sort into the parallel fastest sort you can • Submit files in home directory, one page with output on 1, 2 procs + possible explanation (on one side of the page)

Analysis of Cilk

Analysis of Cilk

Presentation Transcript

CS 240A : Examples with Cilk++

Cilk ++

Cilk Pousse

Optimizing LU Factorization in Cilk ++

Cilk-5

CS 240A: Shared Memory &amp; Multicore Programming with Cilk++

CS 240A : Breadth-first search in Cilk ++

CS 140 : Non-numerical Examples with Cilk++

Impressions from experiments with Cilk +

CS 240A : Numerical Examples in Shared Memory with Cilk ++

CS 240A : Examples with Cilk++

Atomic Transactions in Cilk

CS 240A : Divide-and-Conquer with Cilk ++

Cilk

CILK: An Efficient Multithreaded Runtime System

Parallel Processing Final Presentation CILK

CS 240A : Linear Algebra in Shared Memory with Cilk ++

Cilk NOW

The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall)

Multithreaded Programming in Cilk L ECTURE 3

CILK: An Efficient Multithreaded Runtime System

Cilk NOW

Analysis of Cilk

Analysis of Cilk

Presentation Transcript

CS 240A : Examples with Cilk++

Cilk ++

Cilk Pousse

Optimizing LU Factorization in Cilk ++

Cilk-5

CS 240A: Shared Memory &amp;amp; Multicore Programming with Cilk++

CS 240A : Breadth-first search in Cilk ++

CS 140 : Non-numerical Examples with Cilk++

Impressions from experiments with Cilk +

CS 240A : Numerical Examples in Shared Memory with Cilk ++

CS 240A : Examples with Cilk++

Atomic Transactions in Cilk

CS 240A : Divide-and-Conquer with Cilk ++

Cilk

CILK: An Efficient Multithreaded Runtime System

Parallel Processing Final Presentation CILK

CS 240A : Linear Algebra in Shared Memory with Cilk ++

Cilk NOW

The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall)

Multithreaded Programming in Cilk L ECTURE 3

CILK: An Efficient Multithreaded Runtime System

Cilk NOW

CS 240A: Shared Memory & Multicore Programming with Cilk++