1 / 44

Designing Parallel Programs

Designing Parallel Programs. David Rodriguez-Velazquez CS-6260 Spring-2009 Dr. Elise de Doncker. Manual vs. Automatic Parallelization. Designing and developing parallel programs has been a very MANUAL process. The programmer is responsible for both: Identifying & Implementing parallelism

ladonnap
Télécharger la présentation

Designing Parallel Programs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Designing Parallel Programs David Rodriguez-Velazquez CS-6260 Spring-2009 Dr. Elise de Doncker

  2. Manual vs. Automatic Parallelization • Designing and developing parallel programs has been a very MANUAL process. • The programmer is responsible for both: • Identifying & Implementing parallelism • Manually developing parallel codes is a • Time consuming • Complex • Error-prone • Iterative process

  3. Outline • Parallelization • Partitioning • Communication • Efficiency • Synchronization • Data Dependency • Load Balancing • Granularity • I/O • Amdhal’s Law • Complexity • Portability • Resource Requirements • Scalability • MPI demo • Matrix Share Memory • Matrix multiplication • Alltoall • Heat Equation

  4. Parallelizing Compiler (Pre-Processor) • Most common type of tool used to automatically parallelize a serial program into parallel programs • Parallelizing Compiler works in 2 different ways: • Fully Automatic • Programmer Directed

  5. Parallelizing Compiler (Fully Automatic) • The compiler analyzes the source code and identifies opportunities for parallelism • The analysis includes: • Identifying inhibitors to parallelism • Possibly a cost weighting on whether or not the parallelism would actually improve performance • Loops (do, for) loops are the most frequent target for automatic parallelization

  6. Parallelizing Compiler (Programmer Directed) • Using “compiler directives” or possibly compiler flags, the programmer explicitly tell the compiler how to parallelize the code • May be able to be used in conjunction with some degree of automatic parallelization also

  7. Automatic Parallelization(Caveats) • Wrong results may be produced • Performance may actually degrade • Much less flexible than manual parallelization • Limited to a subset (mostly loops) of code • May actually not parallelize code if the analysis suggest there are inhibitors or the code is too complex

  8. Understand the Problem & the Program • First step in developing parallel software is to: • Understand the problem that you wish to solve in parallel (from serial program you need to understand the existing code) • Before spending time : determine whether or not the problem is one that can actually be parallelized • Identify the program’s hotspots (Know where of the real work is being done. Performance analysis tools can help here) • Identify bottlenecks ( I/O is usually something that slows a program down. Change algorithms to reduce or eliminate unnecessary slow areas) • Investigate other algorithms • Investigate inhibitors to parallelism . One common class of inhibitor is data dependence

  9. Examples (Parallelizable?) • Example of Parallelizable Problem • Calculate the potential energy for each of several thousand independent conformations of a molecule. When done, find the minimum energy conformation • Each of the molecular conformation is independently determinable. The calculation of the minimum energy conformation is also a parallelizable problem • Example of Non-parallelizable Problem • Calculation of the Fibonacci series (1,1,2,3,5,8,13,21) • F(K + 2) = F(K + 1) + F(K) • The calculation of the Fibonacci sequences as shown would entail dependent calculations rather than independents ones. The calculation of the k + 2 values uses those of both k + 1 and k. These three terms cannot be calculated independently and therefore, not in parallel

  10. Partitioning • Partitioning • Break theproblemintodiscrete “chunks” of workthat can bedistributedtomultipletasks • Domaindecomposition & Functionaldecomposition

  11. Partition • Domain Decomposition: the data associated with a problem is decomposed. Each parallel task then works on a portion of of the data.

  12. Partition • Functional Decomposition: In this approach, the focus is on the computation that is to be performed rather than on the data manipulated by the computation. The problem is decomposed according to the work that must be done. Each task then performs a portion of the overall work.

  13. Partition (Functional Decomposition)

  14. Communications • WhoneedsCommunications : • Youdon’tneed : • Sometypes of problems can bedecomposed and execute in parallel. Embarrassinglyparallel. • Verylittle inter-taskcommunicationisrequired • Eg. Imageprocessingoperation, every pixel in a black and whiteimageneedstohaveits color reversed • You do need • Mostparallelapplications do requireto share data witheachother. (Eg. Ecosystem)

  15. Communications(Factorstoconsider) • There are a number of importantfactorstoconsiderwhendesigningprogram’s inter-taskcommunications: • Cost of communications • Latency vs. Bandwidth • Visibility of communications • Synchronous vs. Asynchronouscommunication • Scope of communications • Efficiency of communications

  16. Communications (Cost) • Inter-task communication virtually always implies overhead • Machine cycles and resources that could be used for computation are instead used to package and transmit data. • Communications frequently require some type of synchronization between tasks, which can result in tasks spending time "waiting" instead of doing work. • Competing communication traffic can saturate the available network bandwidth, further aggravating performance problems

  17. Communications (Latency vs. Bandwidth) • latency is the time it takes to send a minimal (0 byte) message from point A to point B. Commonly expressed as microseconds. • bandwidth is the amount of data that can be communicated per unit of time. Commonly expressed as megabytes/sec or gigabytes/sec. • Sending many small messages can cause latency to dominate communication overheads. Often it is more efficient to package small messages into a larger message, thus increasing the effective communications bandwidth.

  18. Communications (Visibility) • Message passing Model: communications are explicit (under control of the programmer) • Data Parallel Model: communications occur transparently to the programmer, usually on distributed memory architectures.

  19. Communications (Synchronous vs. Asynchronous) • Synchronous requires some type of “handshaking” between task that are sharing data. • Synchronous : Blocking communications • Asynchronous allow tasks to transfer data independently from one another. • Asynchronous: Non-Blocking communications • Interleaving computation with communication is the greatest benefit.

  20. Communications (Scope) • Knowing which tasks must communicate with each other is critical during design stage of a parallel code. • Two scoping can be implementing sync. Or async. • Point to Point: 2 task (sender/producer of data and receiver/consumer) • Collective: data sharing between more than two tasks

  21. Communications (Scope-Collective)

  22. Efficiency of communications • Very often, the programmer will have a choice with regard to factors that can affect communications performance. • Which implementation for a given model should be used? (Eg.MPI implementation may be faster on a given hardware platform than another) • What type of communication operations should be used? (Eg. asynchronous communication operations can improve overall program performance) • Network media - some platforms may offer more than one network for communications. Which one is best?

  23. Synchronization(Types) • Barrier • Alltasks are involved • Eachtaskperformitswork. Whenthelasttaskreachesthebarrier, alltask are synchronized • Lock / semaphore • Typicallyusedtoserializeaccessto global data orsection of code. Taskmustwaitto use thecode • Synchronouscommunicationoperations • Involvesonlythosetasksexecuting a communicationoperations (handshaking)

  24. Data Dependencies • A dependenceexistsbetweenprogramstatementswhentheorder of statementsexecutionaffectstheresults of theprogram • A data dependenceresultsfrommultiple use of thesamelocation(s) in storagebydifferenttasks. • Dependencies are importanttoparallelprogrammingbecausethey are one of theprimaryinhibitorstoparallelism

  25. Data Dependencies • Loop carried data dependence (most important) DO J = MYSTART,MYEND A(J) = A(J-1) * 2.0 CONTINUE • The value of A(J-1) must be computed before the value of A(J), therefore A(J) exhibits a data dependency on A(J-1). Parallelism is inhibited. • If Task 2 has A(J) and task 1 has A(J-1), computing the correct value of A(J) necessitates: 1 Calculate, 2 get value • Loop independent data dependence • task 1 task 2 X = 2 X = 4 Y = X**2 Y = X**3 • As with the previous example, parallelism is inhibited. The value of Y is dependent on:

  26. Data Dependencies • How to Handle Data Dependencies: • Distributed memory architectures - communicate required data at synchronization points. • Shared memory architectures -synchronize read/write operations between tasks.

  27. Load Balancing • Refers to the practice of distributing work among tasks so that all task are kept busy all of the time. • It can be considered a Minimization of task idle time • Important for performance reasons

  28. Load Balancing • How to achieve • Equally partition the work each task receives • Use dynamic work assignment

  29. How to Achieve (Load Balancing) • Equally partition the work each task receives • For array/matrix operations where each task performs similar work, evenly distribute the data set among the tasks. • For loop iterations where the work done in each iteration is similar, evenly distribute the iterations across the tasks.

  30. How to Achieve (Load Balancing) • Use dynamicworkassignment • When the amount of work each task will perform is intentionally variable, or is unable to be predicted, it may be helpful to use a scheduler - task pool approach. As each task finishes its work, it queues to get a new piece of work. • It may become necessary to design an algorithm which detects and handles load imbalances as they occur dynamically within the code. • Sparse arrays:some task with zeros • Adaptivegridmethods: sometaskneedto refine theirmesh

  31. How to Achieve (Load Balancing)

  32. Granularity(Computation / Communication Ratio) • Granularity is a qualitative measure of the ratio of computation to communication • Periods of computation are typically separated from periods of communication by synchronization events • Two types • Fine-grain Parallelism • Coarse-grain Parallelism

  33. Granularity (Fine-grain Parallelism) • Relativelysmallamounts of computationalwork are done betweencommunicationevents • Lowcomputationtocommunication ratio • Implieshighcommunicationoverhead • Ifgranularityistoo fine itispossiblethattheoverheadrequiredforcommunications and synchronizationbetweentaskstakeslongerthanthecomputation

  34. Granularity (Coarse-grain Parallelism) • Relatively large amounts of computational work are done between communication/synchronization events • High computation to comunication rate • Implies more opportunity for performance increase • Harder to load balance efficiently

  35. Granularity (What is Best?) • Themostefficientgranularitydependonthealgorithm and the hardware environment in whichitruns • In most cases theoverheadassociatedwithcommunication and synchronizationishighrelativetoexecutionspeed so itisadvantageoustohavecoarsegranularity • Fine-grainparallelism can help reduce overheadsdueto load imbalance. Facilitates load balancing

  36. I/O • I/O operations are inhibitorstoparallelism • Parallel I/O systemsmaybeinmatureornotavailableforallplatforms • Ifall of thetasksseethesamefilespace, WRITE operations can result in fileoverwriting • Readoperations can beaffectedbythefileserver’sabilitytohandlemultiplereadrequests at thesame time • I/O overnetworks can cause bottlenecks/crashfile servers

  37. Amdahl’s Law • States that: “Potential program speedup is defined by the fraction of code (P) that can be parallelized” Speedup = 1 / (1 – P) • If P = 0 then speedup = 1 (no code parallelized) • If P = 1 then speedup is infinite (all code parallelized) • If P = .5 then speedup is 2 (50% of the code parallelized) meaning the code will run twice as fast.

  38. Amdahl’s Law • Introducing the number of processors performing the parallel fraction of work Speedup = 1 / ((P / N) + S) P = parallel fraction, N = number of processors S = serial fraction

  39. Complexity • Parallel applications are much more complex than corresponding serial applications. • Cost of complexity is measured in programmer time in every aspect of the software development cycle • Design, Coding, Debugging, Tuning, Maintenance

  40. Portability • There are standardization in some API’s s.t. MPI • Implementations will differ in a number of details, requiring code modifications • Hw architectures can affect portability • Operating systems can play a key role in code portability issues • All of the portability issues associated with serial programs apply to parallel programs

  41. Resource Requirements • Goal of Parallelprogrammingisdecreaseexecutionwallclock time, more CPU time isrequired. Eg. 1 parallelcodethatruns 1 houron 8 processorsactually use 8 hours of CPU time • Amount of memory can begreather in parallel • Short parallelcodeitispossible a decrease in performance. (setting up theparallelenvironment, taskcreation/termination, communication)

  42. Scalability • Result of a numer of interrelatedfactors • Adding more machines israrelytheanswer • At somepoint, adding more resources causes performance todecrease • Hardware factorsplay a significant role in scalability. • Communicationsnetworkbandwidth • Amount of memoryavailableonany machine • Parallelsupportlibraries and subsystems (limit)

  43. References • Author: Blaise Barney, Livermore Computing. • A search on the WWW for "parallel programming" or "parallel computing" will yield a wide variety of information. • "Designing and Building Parallel Programs". Ian Foster. http://www-unix.mcs.anl.gov/dbpp/ • "Introduction to Parallel Computing". AnanthGrama, Anshul Gupta, George Karypis, Vipin Kumar. http://www-users.cs.umn.edu/~karypis/parbook/

  44. Question • Mention 5 Communication factors to be consider when you are designing a Parallel Program • Cost of Communication • Latency , Bandwidth • Visibility • Synchronous , Asynchronous • Scope

More Related