1 / 45

Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder

Chapter 5: Scalable Algorithmic Techniques. Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder. Scalable program construction. Can be improved by larger problem size Focus on data parallel. Ideal parallel computation. Large blocks of independent computation

agreenwood
Télécharger la présentation

Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 5:Scalable Algorithmic Techniques Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder

  2. Scalable program construction Can be improved by larger problem size Focus on data parallel

  3. Ideal parallel computation Large blocks of independent computation BOINC projects at Berkeley SETHI project These kinds of projects are atypical

  4. Important principle Parallel programs are more scalable when they emphasize blocks of computation– typically the larger the block the better – that minimize the inter-thread dependencies.

  5. Schwarz’s algorithm • Tree should be used to connect processes rather than items • Given P < n • Encode as in 1.3 • Each process add n/P items locally then combine the P intermediate sums with a P-leaf tree that connects the processes. • All processes are working directly on the problem

  6. Figure 5.1 Schwartz -Process-induced tree. Each process computes locally on a sequence of values (heavy lines), and then combines the results pair-wise, inducing a tree; notice that process 0 participates at each level in the tree.

  7. Figure 5.2 Schwartz algorithm inducing the tree of Figure 5.1. Line 8 loads the locally computed value into the tree; line 14 performs the summation when both operands are available. Threads exit when they have nothing left to do.

  8. Advocate use of reduce and scan Even though not in programming languages Code as functions High level Conveys information about program logic

  9. Reduce/Scan are common/important • Reduce • Combine a set of values to compare or combine results • Scan • Parallel prefix • Performs a sequential operation in parts • Carries intermediate results

  10. Kinds of Scans • Given A = {2, 4, 6} • Inclusive • +\A = {2, 6, 12} • Used by Peril-L • Exclusive • +\A = {0, 2, 6} • First item is the identity item for the set

  11. Examples of Reduces • 2nd smallest array element • Use smallest and 2ndsmallest • If array value is smaller update each accordingly • Histogram – compute with k intervals • Use min and max reduce to find smallest/largest • Initialize k element array, hist, to 0’s • Iterate through data counting interval it belongs

  12. Examples (cont.) • Length of longest run of consecutive 1’s • current = 0, longest = 0 • current is current run of 1’s • Answer is max(current, longest) • Index of first x • Create 2 element temp array • temp[0] = x, temp[1] = +-infinity • Iterate looking for x, keep smaller of saved index and found index

  13. Basic structure of reduce and scan • Local variable tally stores intermediate results • Functions • init() initializes tally • accum() performs local accumulation • combine() composes intermediate tally results and passes them to parent • x-gen() takes global result to generate final answer • Will vary for scan and reduce

  14. Example +/ A (reduce) • init() – tally = 0 • accum(tally, val) – tally = tally + A[i] • combine(left, right) – adds left and right tally values and passes tally to the parent • reduce-gen(root) has nothing to do, returns its argument as the global result logic shown in next slide

  15. Figure 5.3

  16. Figure 5.4 Peril-L code for the generalized reduce logic. Notice the sites for the four component functions. The tree combining relies on the use of full/empty memory, which drives the tree accumulation. As threads complete their roles in the combining tree, they terminate.

  17. Figure 5.5 The four generalized reduce functions implementing secondMin reduce. The tally is a two-element struct.

  18. Generalized Scan Like reduce except after combining the intermediate results are passed down the combining tree. The value that each process receives from its parent is the tally for the values that are left of the parent’s leftmost leaf.

  19. Examples of Scan Team Standings Keep the longest sequence of 1s Index of Last Occurrence

  20. Figure 5.6

  21. Figure 5.7 Generalized scan program. The down sweep of the tally values, beginning on line 35, distributes intermediate results to all threads to compute the final result (line 44).

  22. Figure 5.7 Generalized scan program. The down sweep of the tally values, beginning on line 35, distributes intermediate results to all threads to compute the final result (line 44). (cont.)

  23. Examples of Scan Given array A of int 1, …, k lastOccurrence \ A Returns position i the index of the most recent occurrence of A[i] accum stores I in tally [j], last occurrence Combine takes the max of each element Scan generator reprocess the block of data using ptally as its initial value

  24. Figure 5.8 Customized scan functions to return the index of the last occurrence of the element in the ith operand position; the tally is a globally allocated array of k elements.

  25. Assigning work to processes statically • Block allocations • Exploit locality • Better than complete rows • Yields less communication • 4x4 => 16 edge elements • 16 element row => 2*16 = 32 edge elements

  26. Figure 5.9

  27. Overlap Regions Stencil computation – reference neighbor elements Allocate extra space for neighbors

  28. Figure 5.10

  29. Cyclic and block cyclic allocations May result in poor load balance when work is not proportional to the amount of data. Processes that own black and white portions have less work to do After 25% is done, 7 processes have nothing to do The last 25% is done just by P15

  30. Figure 5.11

  31. Solution – use cyclic distribution Allocate elements to processes in a round-robin fashion Cyclic allocation balance hot spots Small block size will incur overhead with communication with neighbors Small blocks do not use locality Size of blocks must be carefully determined

  32. Figure 5.12 Illustration of a cyclic distribution of an 8 × 8 array onto five processes.

  33. Figure 5.13 Block-cyclic allocation of 3 × 2 blocks to a 14 × 14 array distributed to four processes (colors).

  34. Figure 5.14 The block-cyclic allocation of Figure 5.13 midway through the computation; the blocks to the right summarize the active values for each process.

  35. Julia sets need load balancing zn+1 =z2n + c c is complex coefficient to determine the shape

  36. Figure 5.15 Julia set generated from the site http://aleph0.clarku.edu/~djoyce.

  37. Figure 5.16 Example of an unstructured grid representing the pressure distribution on two airfoils. Image from http://fun3d.larc.nasa.gov/example-24.html.

  38. Assigning work dynamically • Work queue • Data structure for dynamically assigning work to threads or processes • Tasks added at one end and removed from other • Example is Collatz Conjecture (in text) • Example of producer/consumer

  39. Figure 5.17 Code for computing the expansion factor for the Collatz Conjecture.

  40. Figure 5.17 Code for computing the expansion factor for the Collatz Conjecture (cont.).

  41. Hoard memory allocator • Solves problem of memory allocated in 1 process and freed in another • Principles • Limit local memory usage • Manage memory in large blocks p 139

  42. Trees • Challenges • Local pointers • They are dynamic which may cause communication issues • Irregular structure challenges reasoning about communication and load balancing

  43. Figure 5.18 Cap allocation for a binary tree on P = 8 processes. Each process is allocated one of the leaf subtrees, along with a copy of the cap (shaded).

  44. Figure 5.19 Logical tree representations: (a) a binary tree where P = 8; (b) a binary tree where P = 6.

  45. Figure 5.20 Enumerating the Tic-Tac-Toe game tree; a process is assigned to search the games beginning with each of the four initial move sequences.

More Related