1 / 35

High-Performance Grid Computing and Research Networking

High-Performance Grid Computing and Research Networking. Classic Examples of Shared Memory Program. Presented by Yuming Zhang Instructor: S. Masoud Sadjadi http://www.cs.fiu.edu/~sadjadi/Teaching/ sadjadi At cs Dot fiu Dot edu. Acknowledgements.

argus
Télécharger la présentation

High-Performance Grid Computing and Research Networking

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High-Performance Grid Computing and Research Networking Classic Examples of Shared Memory Program Presented by Yuming Zhang Instructor: S. Masoud Sadjadi http://www.cs.fiu.edu/~sadjadi/Teaching/ sadjadi At cs Dot fiu Dot edu

  2. Acknowledgements • The content of many of the slides in this lecture notes have been adopted from the online resources prepared previously by the people listed below. Many thanks! • Henri Casanova • Principles of High Performance Computing • http://navet.ics.hawaii.edu/~casanova • henric@hawaii.edu

  3. Domain Decomposition • Now that we know how to create and manage threads, we need to decide which thread does what • This is really the art of parallel computing • Fortunately, in shared memory, it is often quite simple • We’ll look at three examples • “Embarrassingly” parallel application • load-balancing issue • “Non-embarrassingly parallel” application • thread synchronization issue • Shark & Fish simulation • load-balancing AND thread synchronization issue

  4. Embarrassingly Parallel • Embarrassingly parallel applications • Consists of a set of elementary computations • These computations can be done in any order • They are said to be “independent” • Sometimes referred to as “pleasantly” parallel • Trivial Example:Compute all values of a function of two variables over a 2-D domain • function f(x,y) = <requires many flops> • domain = (]0,10],]0,10]) • domain resolution = 0.001 • number of points = (10 / 0.001)2 = 108 • number of processors and of threads = 4 • each thread performs 25x106 function evaluations • No need for critical sections • No shared output

  5. Mandelbrot Set • In many cases, the “cost” of computing f varies with its input • Example: Mandelbrot • For each complex number c • Define the series • Z0 = 0 • Zn+1= Zn2 + c • If the series converges, put a black dot at point c • i.e., if it hasn’t diverged after many iterations • If one partitions the domain in 4 squares among 4 threads, some of the threads will have much more work to do than others

  6. Mandelbrot and Load Balancing • The problem with partitioning the domain into 4 identical tiles is that it leads to load imbalance • i.e., suboptimal use of the hardware resources • Solution: • do not partition the domain in as many tiles as threads • instead use many more tiles than threads • Then have each thread operate as follows • compute a tile • when done “request” another tile • until there are no tiles left to compute • This is called a “master-worker” execution • confusing terminology that will make more sense when we do distributed memory programming

  7. Mandelbrot implementation • Conceptually very simple, but how do we write code to do it? • Pthreads • Use some shared (protected) counter that keeps track of the next tile • the “keeping track” can be easy or difficult depending of the shape of the tiles • Threads read and update the counter each time • When the counter goes over some predefined value terminate • OpenMP • Could be done in the same way • But OpenMP provides tons of convenient ways to do parallel loops • including “dynamic” scheduling strategies, which do exactly what we need! • Just write the code as a loop over the tiles • Add the proper pragma • And you’re done

  8. Dependent Computations • In many applications, things are not so simple: elementary computations may not be independent • otherwise parallel computing would be pretty easy • A common example: • Consider a (1-D, 2-D, ...) domain that consists of “cells” • Each cell holds some “state”, for example: • temperature, pressure, humidity, wind velocity • RGB color value • The application consists of rule(s) that must be applied to update the cell states • possibly over-and-over in an iterative fashion • CFD, game of life, image processing, etc. • Such applications are often termed Stencil Applications • We have already talked about one example: Heat Transfer

  9. Dependent Computations • Really simple: • Cell values: one floating point number • Program written with two arrays: • f_old • f_new • One simple loop: f_new[i] = f_old[i] + ... • In more “real” cases, the domain in 2-D (or worse), there are more terms, and the values on the right hand side can be at time step m+1 as well • Example from: http://ocw.mit.edu/NR/rdonlyres/Nuclear-Engineering/22-00JIntroduction-to-Modeling-and-SimulationSpring2002/55114EA2-9B81-4FD8-90D5-5F64F21D23D0/0/lecture_16.pdf

  10. Wavefront Pattern • Data elements are laid out as multidimensional grids representing a logical plane or space. • The dependency between the elements, often formulated by dynamic programming, results in computations known as wavefront. 2-D domain Example stencil shapes i-1,j-1 i-1,j i,j-1 i,j

  11. The Longest-Common-Subsequence Problem • LCS • Given two sequences A=<a1,a2,…,an> and B=<b1,b2,…,bn>, find the longest sequence that is a subsequence of both A and B. • If A =<c,a,d,b,r,z> and B =<a,s,b,z>, the longest common subsequence of A and B is <a,b,z>. • a valuable tool in finding valuable information regarding amino acid sequences in biological genes. • Determine F[n, m] • Let F[i, j] be the length of the longest common subsequence of the first i elements of A and the first j elements of B.

  12. LCS Wavefront F[i-1,j-1] F[i-1,j] F[i,j-1] F[i,j] The computation starts from F[0,0] and starts filling out the memoization space table diagonally.

  13. One example Computing the LCS of amino acid sequences <H, E, A, G, A, W, G, H,E> and <P, A, W, H, E, A, E>. F[n, m] = 5 is the answer.

  14. Wavefront computation • How can we parallelize a wavefront computation? • We have seen that the computation consists in computing 2n-1 antidiagonals, in sequence. • Computations within each antidiagonal are independent, and can be done in a multithreaded fashion • Algorithm: • for each antidiagonal • use multiple threads to compute its elements • one may need to use a variable number of threads because some diagonals are very small, while some can be large • can be implemented with a single array

  15. Wavefront computation • What about cache efficiency? • After all, reading only one element from an anti diagonal at a time is probably not good • They are not contiguous in memory! • Solution: blocking • Just like matrix multiply p0 p1 p2 p3

  16. Wavefront computation • What about cache efficiency? • After all, reading only one element from a diagonal at a time is probably not good • Solution: blocking • Just like matrix multiply p0 p1 p2 p3 1

  17. Wavefront computation • What about cache efficiency? • After all, reading only one element from a diagonal at a time is probably not good • Solution: blocking • Just like matrix multiply p0 p1 p2 p3 2 2

  18. Wavefront computation • What about cache efficiency? • After all, reading only one element from a diagonal at a time is probably not good • Solution: blocking • Just like matrix multiply p0 p1 p2 p3 3 3 3

  19. Wavefront computation • What about cache efficiency? • After all, reading only one element from a diagonal at a time is probably not good • Solution: blocking • Just like matrix multiply p0 p1 p2 p3 1 3 4 2 2 3 4 5 3 4 5 6 6 4 5 7

  20. Workload Partitioning • First the matrix is divided into parts of adjacent columns equal to the numbers of clusters. • Afterwards the part within each cluster is partitioned. The computation is then performed in the same way.

  21. Performance Modeling • One thing we’ll need to do often in HPC is building performance models • Given simple assumptions regarding the underlying architecture • e.g., ignore cache effects • Come up with an analytical formula for the parallel speed-up • Let’s try it on this simple application • Let N be the (square) matrix size • Let p be the number of threads/cores, which is fixed

  22. Performance Modeling T3 T2 T0 T1 • What if we use p2 blocks? • We assume that p divides N (N > p) • Then the computation proceeds in 2p-1 phases • each phase lasts as long as the time to compute one block (because of concurrency), Tb • Therefore • Parallel time = (2p-1) Tb • Sequential time = p2 Tb • Parallel speedup = p2 / (2p - 1) • Parallel efficiency = p / (2p -1) • Example: • p=2, speedup = 4/3, efficiency = 66% • p=4, speedup = 16/7, efficiency = 57% • p=8, speedup = 64/17, efficiency = 53% • Asymptotically: efficiency = 50%

  23. Performance Modeling 1 0 3 2 1 0 • What if we use (bxp)2 blocks? • b some integer between 1 and N/p • We assume that p divides N (N > p) • But performance modeling becomes more complicated • The computation still proceeds in 2bp-1 phases • But a thread can have more than one block to compute during a phase! • During phase i, there are • i blocks to compute for i=1,..,bp • 2bp-i blocks to compute for i=bp+1,...,2bp-1 • If there are x (>0) blocks to compute in a phase, then the execution time for that phase is: (x-1)/p + 1) • Assuming Tb = 1 • Therefore, the parallel execution time is

  24. Performance Modeling

  25. Performance Modeling • Example: N=1000, p = 4

  26. Performance Modeling • When b gets larger, speedup increases and tends to p • Since b <= N/p, best speed-up: Np / (N + p -1) • When N is large compared to p, speedup is very close to p • Therefore, use a block size of 1, meaning no blocking! • We’re back to where we started because our performance model ignores cache effects! • Trade-off: • From a parallel efficiency perspective: small block size • From a cache efficiency perspective: big block size • Possible rule of thumb: use the biggest block size that fits in the L1 cache (L2 cache?) • Lesson: full performance modeling is difficult • We could add the cache behavior, but think of a dual-core machine with shared L2 cache, etc. • In practice: do performance modeling for asymptotic behaviors, and then do experiments to find out what works best

  27. Sharks and Fish • Simulation of a population of preys and predators • Each entity follows some behavior • Preys move and breed • Predators move, hunt, and breed • Given initial populations, nature of the entity behaviors (e.g., probability of breeding, probability of successful hunting), what do populations look like after some time? • This is something computational ecologists do all the time to study ecosystems

  28. Sharks and Fish • There are several possibilities to implement such a simulation • A simple one is to do something that looks like “the game of life” • A 2-D domain, with NxN cells (each cell can be described by many environmental parameters) • Each cell in the domain can hold a shark or a fish • The simulation is iterative • There are several rules for movement, breeding, preying • Why do it in parallel? • Many entities • Entity interactions may be complex • How can one write this in parallel with threads and shared memory?

  29. Space partitioning • One solution is the divide the 2-D domain between threads • Each thread deals with the entities in its domain

  30. Space partitioning • One solution is the divide the 2-D domain between threads • Each thread deals with the entities in its domain 4 threads

  31. Move conflict? • Threads can make decisions that will lead to conflicts!

  32. Move conflict? • Threads can make decisions that will lead to conflicts!

  33. Dealing with conflicts • Concept of shadow cells Only entities in the red regions may cause a conflict • One possible implementation • Each thread deals with its green region • Thread 1 deals with its red region • Thread 2 deals with its red region • Thread 3 deals with its red region • Thread 4 deals with its red region • Repeat • Will still prevent some types of moves • No swapping of location • The implementer must make choices

  34. Load Balancing • What if all the fish end up in the same region? • because they move • because they breed • Then one thread has much more work to do that the others • Solution: dynamic repartitioning • Modify the partitioning so that the load is balanced • But perhaps one good idea would be to not do domain partitioning at all! • How about doing entity partitioning • Better load balancing, but more difficult to deal with conflicts • May use locks, but high overhead

  35. Conclusion • Main lessons • There are many classes of applications, with many domain partitioning schemes • Performance modeling is fun but inherently limited • It’s all about trade-offs • overhead - load balancing • parallelism - cache usage • etc. • Remember, this is the easy side of parallel computing • Things will become much more complex in distributed memory programming

More Related