CS6068 Parallel Computing: Fall 2015 - Week 2 Parallel Computing Patterns

CS6068 Parallel Computing:Fall 2015 - Week 2Parallel Computing Patterns

Readings - Topics • The Landscape of Parallel Computing (UC Berkeley perspective) • http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf • The Berkeley Par Lab Book • http://research.microsoft.com/en-us/collaboration/parlab/ • Part II: Parallel Design Patterns and Algorithms • Chapter 8:Introduction to Design Patterns for Parallel Computing

7 Questions for Parallel Computing • Applications: • 1. What are the apps? • 2. What are kernels of apps? • Hardware: • 3. What are the HW building blocks? • 4. How to connect them? • Programming Model & Systems Software: • 5. How to describe apps and kernels? • 6. How to program the HW? • Evaluation: • 7. How to measure success?

What are the apps? • Today we look at two of increasing importance… • Complex Systems Simulation: complex systems have many interconnected components, and simulating and predicting outcomes is important and computationally difficult, e.g., climate change, responses to earthquakes and disease epidemics. • Content-based image retrieval: Consumer-image databases are growing so dramatically they require automated search instead of manual labeling. Low error rates require processing very high dimensional feature spaces. Current image classifiers are too slow to deliver adequate response times.

Structural Software Patterns 1 • Map reduce: problems where the same function is independently applied. Main issue to best exploit the computational efficiency latent in this structure. The solution is to define a program structured as two or three distinct phases: map, shuffle, reduce. The map applies function to distributed data, shuffle reorganizes after mapping, and reduction may be a summary computation, or merely a data reduction.

Structural Software Patterns 2 • Pipe-and-filter: problems characterized by data flowing through modular phases of computation. The solution constructs the program as filters (computational elements) connected by pipes (data communication channels). Alternatively, they can be viewed as a graph with computations as vertices and communication along edges. Data flows through the succession of stateless filters, taking input only from its input pipe(s), transforming that data, and passing the output to the next filter via its output pipe

Structural Software Patterns 3 • Agents and Repositories: problems organized as a collection of data elements that are modified at irregular times by a flexible set of distinct operations. The solution is to structure the computation in terms of a centrally-managed data repository, a collection of autonomous agents that operate upon the data, and a manager that schedules and coordinates the agents’ access to the repository and enforces consistency of updates.

Structural Software Patterns 4 • Iterative refinement: a class of problems where a set of operations are applied over and over to a system until a predefined goal is realized or constraint is met. The number of applications of the operation in question may not be predefined, and the number of iterations through the loop may not be able to be statically determined. • The solution to these problems is to wrap a flexible iterative framework around the operations as follows: the iterative computation is performed; the results are checked against a termination condition; depending on the results of the check, the computation completes or proceeds to the next iteration.

Parallel Algorithm Strategy Patterns • Data parallelism: Some problems are best understood as uniform parallel operations on the elements of a data structure. • Task parallelism: These problems are characterized in terms of a collection of activities or tasks. The solution is to schedule the tasks for execution in a way that keeps the work balanced between the processing elements of the parallel computer and manages any dependencies between tasks. • Divide and Conquer: A problem is solved by splitting it into a number of smaller subproblems, solving them independently in parallel, and merging the subsolutions into a solution for the whole problem. • Geometric decomposition: An algorithm is organized by: (1) dividing data into regular chunks, and (2) updating each chunk in parallel. Typically, communication occurs at chunk boundaries so an algorithm breaks down into three components: (1) exchange boundary data, (2) update the interiors or each chunk, and (3) update boundary regions. • Pipeline: concurrency grows up to the number of stages in the pipeline (the so-called depth of the pipeline)

Implementation Strategy Patterns • Single-Program Multiple Data (SPMD): A popular pattern where a process/thread ID is used to index data. A bulk-synchronous orchestration can hide memory latency. • Fork/join: The problem is defined in terms of a set of functions or tasks that execute within a shared address space. The solution is to logically create threads (fork), carry out concurrent computations, and then terminate after possibly combining results from the computations (join). • Kernel Parallel: generalization of the data parallel pattern, in which an index space is defined onto which the data structures. This is a fine-grained expression of the SPMD pattern tuned to data parallel algorithms. Used in GPGPU environments. • Loop-level parallelism: The problem is expressed in terms of a modest number of compute intensive loops. The loop iterations can be transformed so they can safely execute independently. • Actors: An important class of object-oriented programs represents the state of the computation in terms of a set of persistent objects. These objects encapsulate the state of the computation and include the fundamental operations to solve the problem as methods for the objects.

Concurrency Control Agents and RepositoriesTask ParallelismActorsAgentsFork/Join >>>> require Concurrency Control Purpose of Concurrency Control • To enforce Isolation (through mutual exclusion) • To preserve consistency • To resolve read-write and write-write conflicts. Example: In concurrent execution environment if thread T1 conflicts with T2 over a data item A, then concurrency control decides if T1 or T2 should get the A and if the other transaction is rolled-back or waits.

Concurrency Programming Models Shared Memory Concurrency Options: Pessimistic Approach • Locks (popular in Java, C, etc.) • Numerous low level programming issues Optimistic Approach • Transactional Memory - Idea from Databases: ACID • Generally Easier to Implement than Locks • Identify Sections of Code requiring a Consistent View of Data • No Deadlocks, No Race Conditions Software Transactional Memory • Multi-version Concurrency Control • Maintain Multiple Versions of Objects with Commit Timestamps • Snapshot Isolation • Transactions operate on a Snapshot of Memory taken when they started

Shared Memory Parallel Programming • Find independent tasks in the algorithm • Map tasks to execution units (e.g. threads) • Define and implement synchronization among tasks • Avoid races and deadlocks, address memory • model issues, … • Compose parallel tasks • Recover from errors • Ensure scalability • Manage locality • … Capture all in Transactional Memory

Clojure’s STM and Persistent Data Structures Implementation Clojure’s design helps to avoid the issue of data being modified by concurrently running threads by providing immutable data structures. Immutable data requires creating new data structures that are similar to existing ones: for example, a list with a new item added at one end or a hash-map with a new key/value pair added– must share old structures. Structure sharing in this manner is called persistent data. Clojure implements all these with high performance.

Atoms, agents, and refs in Clojure Reference: http://clojure-doc.org/articles/language/concurrency_and_parallelism.html atoms, refs, and agentshelp programs manage concurrent access to shared structures (memory state). Atoms are references that change atomically (changes become immediately visible to all threads). swap! function for atoms apply function f Atomically swaps the value of atom to be value of function f : (apply f current-value-of-atom args). Note that f may be called multiple times, sometimes failing and retried, and thus should be free of side effects.

#example of usage of atoms and swap! (defcounter (atom 0)) (swap! counter inc) ;; @counter -> 0 ;; @counter -> 1 ; Let’s create 2 threads that increment counter two times: (let [n 2] (future (dotimes [_ n] (swap! counter inc))) (future (dotimes [_ n] (swap! counter inc)))) ;; @counter-> 5

#example of usage of atoms and swap! with side-effects (defninc-print [valthd-id] (println“thread“, thd-id, “counter:“, val) (incval)) Createthree threads with printing side-effects. We will see extra print lines from when the swap! needed to retry because of another thread modifying the value before it could set it. (def counter (atom 0)) (let [n 2] (future (dotimes [_ n] (swap! counter inc-print :a))) (future (dotimes [_ n] (swap! counter inc-print :b))) (future (dotimes [_ n] (swap! counter inc-print :c)))) ;;@counter -> 6

Agents Agents are references that are updated asynchronously: updates happen at a later, unknown point in time, in a thread pool. Agents are identities that implement uncoordinated, asynchronous updates. They can be used for anything that does not require strict consistency for reads. Counters (e.g. message rates in event processing) and Collections (e.g. recently processed events) Agents can be used for offloading arbitrary computations to a thread pool. The state of an Agent is always immediately available for reading by any thread. Agent action dispatches take the form (send-off agent fnargs*) At some point later, in another thread, the given fn will be applied to the state of the Agent and the args, if any were supplied.

Refs and Transactions Refs are coordinated reference types. Help ensure that multiple identities can be modified with consistency, within transactions, where all refs are modified or none are. No race conditions when altering refs. No deadlocks. Refs guarantee ACI (not D) Atomic Within the transaction, the updates will occur to all the refs, or if something goes wrong, none of them will be updated, and the transaction is retried. Consistent An optional validator function can be used with the refs to check value before the transaction commits. Isolated A transaction has its own isolated view of the world. If another transaction is also running at the same time, the current transaction will not see any effects from it.

#example of usage of refs and dosync for creating #transactions which apply alter to update refs (defalice-height (ref 1)) (def right-hand-bites (ref 10)) (defn eat-from-right-hand [] (dosync (when (pos? @right-hand-bites) (alter right-hand-bites dec) (alter alice-height #(* % 2))))) (let [n 2] (future (dotimes [_ n] (eat-from-right-hand))) (future (dotimes [_ n] (eat-from-right-hand))) (future (dotimes [_ n] (eat-from-right-hand)))) @alice-height ;; -> 64 @right-hand-bites ;; -> 4

#example of usage of commute to improve performance (defalice-height (ref 1)) (def right-hand-bites (ref 10)) (defn eat-from-right-hand [] (dosync (when (pos? @right-hand-bites) (commute right-hand-bites dec) (commute alice-height #(* % 2))))) (let [n 3] (future (dotimes [_ n] (eat-from-right-hand))) (future (dotimes [_ n] (eat-from-right-hand))) (future (dotimes [_ n] (eat-from-right-hand)))) @alice-height ;; -> 512 @right-hand-bites ;; -> 1

An Agents and Repos Structure Pattern • https://www.youtube.com/watch?v=dGVqrGmwOAw • http://limist.com/coding/an-example-of-literate-programming-in-clojure-using-emacsorg.html • Software Simulation Details: • Ant-agents deployed and store state of the current location of one ant – require uncoordinated, asynchronous updates. • Each cell in a 2-D grid stores food and pher levels and whether an ant is present – if so, indicates direction and if carry-food. These cells are our refs -- coordinated reference types that enforce consistent, mutually exclusive updates. Design of Large-Scale Complex Simulation – Ant Colony

Ants behavior may affect or “alter” the food and pher level of current cell location, also the direction and has-food of present ant – and thus must enforce mutual exclusivity of mutated cell data. Tranactions coordinate updates and verify properties: Turn-ant - coordinated change toward home or nearby cell Move-ant - verified the way is clear Take-food – verified food existence in cell Drop-food – verified ant has food Ant-agents track the location of an ant, and controls the behavior of the ant at that location.

Ant Simulation Transactions + Demo (defnturn [locamt] (dosync (let [p (place loc) ant (:ant @p)] (alter p assoc :ant (assoc ant :dir (bound 8 (+ (:dir ant) amt)))))) loc) (defnmove [startloc] (let [oldp (place startloc) ant (:ant @oldp) newloc (delta-location startloc (:dir ant)) newp (place newloc)] (alter newpassoc :ant ant) (alter oldpdissoc :ant) ;; leave pheromone trail at oldp (when-not (:home @oldp) (alter oldpassoc :pher (inc (:pher @oldp)))) newloc))

App2: Content-Based Image Retrieval The key elements of the application are the feature extractor, the trainer, and the classifier components. Given a set of new images the feature extractor will collect features of the images. Given the features of the new images, chosen examples, and some classified new images from user feedback, the trainer will train the parameters necessary for the classifier. Given the parameters from the trainer, the classifier will classify the new images based on their features. The user can classify some of the resulting images and give feedback to the trainer repeatedly in order to increase the accuracy of the classifier. Pipe-and-filter structural pattern

The feature-extractor, trainer, and classifier are filters or computational elements which are connected by pipes. Data flows through the succession of filters which do not share state and only take input from their input pipe(s). Since each of the filters of CBIR are complex computations they can be further decomposed. In our CBIR application we use a support-vector machine (SVM) classifier. For simplicity we ignore the feature-extractor and trainer for now. SVM is widely used in many classification tasks such as image recognition, bioinformatics, and text processing. SVM finds the “widest-street” separating classes. The structure and computations in the SVM classifier are illustrated in next slides.

Details on the SVM classifier filter The SVM classifier evaluates the function: where xi is the ith support vector, z is the query vector, Φ is the kernel function, αi is the weight, yi in {-1, 1} is the label attached to support vector xi, b is a parameter, and sgn is the sign function. Ref: Historical Background and Math (Vapnik-1960s) Google: YouTube SVM MIT lecture https://www.youtube.com/watch?v=_PwhiWxHK8o

Details on the SVM classifier filter The basic structure of the classifier filter is itself a simple pipe-and-filter structure with two filters: The first filter takes the test data and the support vectors needed (from training) to calculate the dot products between the test data and each support vector. This dot product computation is naturally performed using the dense linear algebra computational pattern. The second filter takes the resulting dot products and computes the kernel values (usually a simple transformation of dot products and Euclidean distances), sums up all the kernel values, and scale the final results if necessary. The structural pattern associated with these computations is MapReduce. In the MapReduce pattern the same computation is mapped to different non-overlapping partitions of the state set. The results of these computations are then gathered, or reduced. We define the MapReduce structure in terms of a Concurrent Algorithmic Strategy.

Algorithmic Strategies The natural choices for Algorithmic Strategies are the data parallelism and geometric decomposition patterns. Using data parallelism we can compute the kernel value of each dot product inparallel. Alternatively, using geometric decomposition we can divide the dot products into regular chunks of data, apply the dot products locally on eachchunk, and then apply a global reduce to compute the summation over all chunks for the final results. We are interested in designs that can utilize large numbers of cores. Since the solution based on the data parallelism pattern exposes more concurrent tasks (due to the large numbers of dot products) compared to the more coarse grained to geometric decomposition solution, we will choose the data parallelism pattern for implementing the map reduce computation.

Summary of the computation of the SVM classifier This pattern was the composition of the Pipe-and-Filter, Dense-Linear-Algebra, and Map-Reduce patterns. To parallelize the Map-Reduce computation, we used the Data-Parallelism pattern. To implement the Data-Parallelism Algorithmic Strategy, both the Strict-Data-Parallel and Loop-Parallel patterns are applicable. We choose the Strict-Data-Parallel pattern, since it seemed a more natural choice given the fact we wanted to expose large amounts of concurrency for use on many-core chips with large numbers of cores such as GPUs. It is common that reductions and barriers are provided as part of a parallel programming environment; hence, a programmer needs to be aware of these constructs and what they provide, but need not explore their implementation in full detail.

Homework #2 due 2 weeks Describe in detail a new ant-simulation behavior rule or property to enhance the Ant-colony Simulation. Describe an algorithmic strategy for implementing your new rule, and discuss the computational and potential concurrency overhead. Provide pseudocode for your rule. Add an implementation to the existing code base for your rule. Possible Examples: Modify the pheromone trail rule to take account of local conditions. Maybe an ant releases more or less pheromone dependent on how crowded the neighborhood is. Include 2 or more classes of ants and provide distinctive behavior rules for each class.

CS6068 Parallel Computing: Fall 2015 - Week 2 Parallel Computing Patterns