1 / 35

Gather/Scatter, Parallel Scan, and Applications

Learn the fundamentals of scatter, gather, and parallel scan algorithms, their implementation in various DSLs, and practical applications. Explore how to build complex algorithms from these parallel building blocks.

Télécharger la présentation

Gather/Scatter, Parallel Scan, and Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gather/Scatter,Parallel Scan,and Applications Lee Barford v2011-09-15

  2. Outline Confidentiality Label • Continue to add parallel algorithm building blocks • Scatter and gather algorithms: array deferencing in parallel • Parallel scan algorithm: what it does • How it is implemented will be covered in a paper next week • Other algorithms built on parallel scan & scatter/gather • Demonstrate usefulness of parallel scan and scatter/gather • Illustrations of how to “think in parallel” • Building more complex parallel algorithms from simpler building blocks

  3. Scatter: Parallel assignment to array elements Confidentiality Label • Vector-indexed assignment appears in a number of DSLs: • Matlab, Scilab, Numeric Python, R, Fortran 90 • A reasonable definition of assignment to a vector, indexed by a vector: • x[v] = y • x, y vectors of compatible types; v vector of index type; v & y same length • x[v[0]] = y[0]; x[v[1]] = y[1]; x[v[3]] = y[2]; … • Undefined result if v[i]==v[j] for any i, j • Corresponding abstract parallel operation is called “scatter”: • scatter(v, y, x)

  4. Scatter: Code example Confidentiality Label int values[10] = {1, 0, 1, 0, 1, 0, 1, 0, 1, 0}; intmap[10] = {0, 5, 1, 6, 2, 7, 3, 8, 4, 9}; thrust::device_vector<int>d_values(values, values + 10); thrust::device_vector<int>d_map(map, map + 10); thrust::device_vector<int>d_output(10); thrust::scatter(d_values.begin(), d_values.end(), d_map.begin(), d_output.begin()); // d_output is now {1, 1, 1, 1, 1, 0, 0, 0, 0, 0}

  5. Scatter_if: Scatter when predicate array element is true Confidentiality Label int V[8] = {10, 20, 30, 40, 50, 60, 70, 80}; intM[8] = {0, 5, 1, 6, 2, 7, 3, 4}; intS[8] = {1, 0, 1, 0, 1, 0, 1, 0}; intD[8] = {0, 0, 0, 0, 0, 0, 0, 0}; thrust::scatter_if(V, V + 8, M, S, D); // D contains [10, 30, 50, 70, 0, 0, 0, 0];

  6. Gather: Build a vector from a vector-indexed vector Confidentiality Label • In matrix/vector DSLs: y = x[v] • y[0] = x[v[0]]; y[1] = x[v[1]]; … • x and y must be compatible types • v must be an index type • y must be able to store at least length(v) elements • Corresponding abstract parallel operation is called “scatter”: • gather(v, x, y)

  7. Gather: Code example Confidentiality Label #include <thrust/gather.h> #include <thrust/device_vector.h> // mark even indices with a 1; odd indices with a 0int values[10] = {1, 0, 1, 0, 1, 0, 1, 0, 1, 0}; map[10] = {0, 2, 4, 6, 8, 1, 3, 5, 7, 9}; thrust::device_vector<int>d_values(values, values + 10); thrust::device_vector<int>d_map(map, map + 10); thrust::device_vector<int>d_output(10); thrust::gather(d_map.begin(), d_map.end(), d_values.begin(), d_output.begin()); // d_output is now {1, 1, 1, 1, 1, 0, 0, 0, 0, 0}

  8. Other forms of scatter and gather in Thrust Confidentiality Label • Scatter_if, where predicate is a unary function <T,bool> • Gather_if, where predicate is an array • Gather_if, where predicate is a unary function <T,bool>

  9. GPU Skeptics Say: • “GPUs are massively data parallel. Therefore GPUs are no good on algorithms with extensive dependencies among the input data.” • “My problem is inherently serial” • Partial Answer: Parallel scan is good at handling many such dependencies in parallel. • “Parallel scan” is also called “parallel prefix”

  10. Parallel scan scan(+, [1 2 3 -2]) = [1 1+2 1+2+3 1+2+3+(-2)] = [1 3 6 4 ] Implementation may non-deterministically evaluate partial sums with any possible parentheses. 1+(2+(3+(-2))) (1+2)+(3+(-2)) 1+ ((2+3)+(-2)) (1+(2+3))+(-2) … • Properties of integers under + required: • Closed: a, b are integers  a+b is an integer • Associative: (a+b)+c = a+(b+c) • Can use parallel scan for any set & operation w/these properties (that is, any semigroup) I will only use the word “semigroup” for “closed and associative.” 10

  11. Example of a finite semigroup Q: How do I know it’s associative? A: Method of exhaustion: I tried all 27 possibilities. • scan(Å1, [I I L I II H H I I I]) = [I I L LLL H HHH H] • Replaces an I with the last L or H • “Remembers” whether L or H was last Confidentiality Label

  12. Parallel scan on finite semigroups computes the same things as finite sequential machines do f Linear # states H/H Exp # elements g L/L H/H Machine(.) scan(Å1, .]) L/L I/I scan(Å1, v]) = Machine(v) L/L Since all computers are finite sequential machines, parallel scan is a “universal” parallel algorithm. BUT it is not always practical. Confidentiality Label

  13. What about parallel scan over “infinite” semigroups? Confidentiality Label • “Infinite”: integers, floats • Can do more than with finite sets • I don’t know of a nice characterization like in the finite case

  14. Unique Problem: Pack the array x so that it contains the unique values in x. x is assumed to be sorted. unique(x) 0 1 2 3 4 5 6 7 Input x: -7 3 7 7 10 10 10 11 b = map(ix[i]!=x[i+1], x) 1 1 1 0 1 0 0 1 ix = scan(+,b) ix’ = map(ii-1, ix) 1 2 3 3 4 4 4 5 =Output length 0 1 2 2 3 3 3 4 0 1 2 3 4 scatter_if(ix’, b, x, y) -7 3 7 10 11 Output is still sorted Confidentiality Label

  15. Copy_if (a/k/a stream compaction) Problem: Make a packed array y containing the values of the elements x[i] so that pred(x[i]) is true. copy_if(x, y, pred) 0 1 2 3 4 5 6 7 Input x: -7 3 17 0 10 13 1 9 Input pred: 0 0 1 0 1 1 0 0 ix = scan(+,b) 0 0 1 1 2 3 3 3 =Output length ix’ = map(ii-1, ix) -1 -1 0 0 1 2 2 2 0 1 2 scatter_if(ix’, pred, x, y) 17 10 13 copy_if is stable: Relative order in y is the same as in x

  16. Other similar algorithms in Thrust Confidentiality Label • unique_by_key(k, v): k=keys, v=values • Retains those k[i]’s & v[i]’s that have unique k[i]’s, and packs k & v • remove_if(x, pred): Remove x[i] from x if pred[x[i]], and packs • remove(x, z): Remove x[i] from x if x[i]==z[i] and pack • unique_copy(), unique_by_key_copy, remove_if_copy(), remove_copy(): • Same as unique(), unique_by_key(), remove(), remove_if() except put output into another output iterator instead of overwriting the first input

  17. Segmented parallel scan Confidentiality Label Begin the sum over again when indicated by an predicate or predicate vector: x = [1 7 2 3 2 8 13] s = [0 0 1 0 0 1 0] segmented_scan(+, x, s) = [1 8 2 3 5 8 21] segmented_scan(Å, x, s) = scan(Å2, ((x1,s1), (x2,s2),…)) (Å associative  Å2is associative.)

  18. Segmented parallel scan: Applications Confidentiality Label • Sparse matrix algorithms • Graph algorithms

  19. Inclusive vs exclusive scan Confidentiality Label Inclusive scan: The first output equals the first input (what we’ve used so far) scan(+, [1 2 3 -2], init=0) = [1 1+2 1+2+3 1+2+3-2] Exclusive scan starts with an initial value. scan(+, [1 2 3 -2], init=0) = [0 0+1 0+1+2 0+1+2+3]

  20. Scans in Thrust Plus others where the Associative Operator defaults to +. Confidentiality Label

  21. Segmented scans in Thrust + 6 other versions with other parameter combinations Confidentiality Label

  22. Transformed scans in Thrust:Increases computation per memory read Confidentiality Label

  23. Thrust set operations Confidentiality Label • Operate on sorted inputs • Produce sorted outputs • Union, intersection, difference • Symmetric difference(A,B) = {e | e in A or a in B but not both}

  24. Summary • Scatter and gather: • Parallel assignment • Parallel rearrangement • Parallel removal of irrelevant entries • Scan & segmented scan: • One way to parallelize many (but not all) seemingly difficult to parallelize algorithms • Parsing, graph algorithms, sparse matrix algorithms, … • Can take some creativity to develop an associative operation (semigroup) for a particular algorithm • All of the algorithms described are in Thrust

  25. BACKUP SLIDES

  26. Theorem: There is a computable, 1-1 correspondence f between finite semigroups (S, Å1) and XXXXXX finite state machines so that (S, Å1) and f((S, Å1) always compute the same output from same input. Confidentiality Label

  27. Parallel CPU Scan Algorithm scan(+, [0 1 2 3 -2 7]) = [0 1 3 6 4 11] input: 0 1 2 3 -2 7 pass 1: 0+0=0 0+1=1 2 2+3=5 -2 -2+7=5 recursion: scan(+, [1, 5, 5]) = [1, 6, 11] pass 2: 6-2=4 4+7=11 0+0=0 0+1=1 1+2=3 3+3=6 • Only properties of integers under + used: • Left identity, 0: 0+a = a • Associativity: (a+b)+c = a+(b+c) • Can use parallel scan for any set & operation w/these properties 27

  28. Gather: Example use of scan(+,.) (a/k/a stream compaction) Problem: Make a packed array containing the indices of the elements >=10 0 1 2 3 4 5 6 7 Input x: -7 3 17 0 10 13 1 9 b = map(x[i]>=10) 0 0 1 0 1 1 0 0 ix = scan(+,b) 0 0 1 1 2 3 3 3 =Output length map(if b[i] then y[ix-1]=b[i]) 0 1 2 Output y: 17 10 13

  29. Application of segmented scan:Sparse matrix-vector multiplication (SpMV) Confidentiality Label • One of the most important operations in scientific code • Serial: typically graph-like, using adjacency lists • Sparse  Matrix has cN non-zeros, for some c<<N • [ 1 2 0 0 ] • [ 0 3 9 0 ] • [ 0 1 4 0 ] • A = [ 1 2 3 9 1 4 ] The nonzero elements • SA = [ 1 0 1 0 1 0] 1 when start a new column • JA = [ 0 1 1 2 1 2 ] JA[i] is the column index for A[i]

  30. SpMV as a segmented scan Confidentiality Label (a, v) Å (a’, v’) = (a+a’*v’, v’)

  31. Parallel scan: summary Confidentiality Label • Way to parallelize many algorithms that seem at first to have difficult dependencies among the inputs • Using parallel scan effectively can take some forethought • “Subgroup design” is not necessarily straightforward

  32. Other Applications of Scan • Sorting • Sparse matrix multiply, iterative solve • ODE’s & PDE’s: CAD & simulation: EE, MechE, physics, chem/biochem • Optimization • Linear programming • Nonlinear programming • Other graph algorithms

  33. Conclusion Scan is one tool that can be used to parallelize lots of algorithms that seem at first to be inherently serial Usually, other patterns (map, reduce,…) are combined with scan to create a complete algorithm Understanding use & performance of scan is one key to judging applicability and performance of GPUs to non-graphics problems

  34. Parallel Transition Localization Samples = vector of numbers Vector s of S’s • Set S with an operatorÅ • 0Å ÎS such that 0Å Å a=a • Å is associative parallel_scan(Å, [s0, s1, s2, …]) = [c0, c1, c2, …] = c Identify transitions in c Transitions Barford, “Parallel Transition Localization,” Proc. IEEE I2MTC, 2010.

  35. Implementation Core 1 Core 2 Core 3 x0x1 x2x3 x4x5 pass 1: s0s0Ås1 s2s2Ås3 s4s4Ås5 recursion: cumsum(Å, [s0Ås1, s2Ås3, s4Ås5) = [c1, c3, c5] c3Ås4c3Ås4Ås5 s0s0Ås1 c1Ås2c1Ås2Ås3 pass 2: Transition? Transition? Transition? Transition? Transition? Transition? • Samples read twice • Other memory needed: constant amount per core (no temp arrays) Confidentiality Label 35

More Related