Parallel and Concurrent Programming

Parallel and Concurrent Programming • Motivation (General): • Multicore Architectures • Systems thatrequire High Performance Computing (HPC) services • Systems thatprovide HPC services • Science and Engineering movingtowards simulation (requiring HPC) • Motivation (Software Engineering): • Understanding the interaction between hardware and software is key to making architectural tradeoffs during design (for HPC) • It is also important to understand the need to balance potential gains in performance versus additional programming effort involved. MSc SAI - Parallel & Concurrent (J Paul Gibson)

Multi-Core MSc SAI - Parallel & Concurrent (J Paul Gibson)

Multi-Core Multi-core processors are MIMD: Different cores execute different threads (Multiple Instructions), operating on different parts of memory (Multiple Data). Multi-core is a shared memory multiprocessor: All cores share the same memory MSc SAI - Parallel & Concurrent (J Paul Gibson)

Multi-Core • Interaction with the Operating System: • • OS perceiveseachcore as a separate processor • • OS schedulermaps threads/processes to differentcores • • Most major OS support multi-core today: Windows, Linux, Mac OS X, … • Why multi-core ? • • Difficult to make single-coreclockfrequenciesevenhigher • • Deeplypipelined circuits: • – heatproblems • – speed of light problems • – difficult design and verification • – large design teams necessary • – server farmsneedexpensive air-conditioning • • Many new applications are multithreaded • • General trend in computer architecture (shift towards more parallelism) MSc SAI - Parallel & Concurrent (J Paul Gibson)

Multi-Core • Instruction-levelparallelism: • • Parallelism at the machine-instruction level • • The processor can re-order, pipeline instructions, split themintomicroinstructions, do aggressivebranchprediction, etc. • • Instruction-levelparallelismenabledrapidincreases in processor speeds over the last 15 years • Thread-levelparallelism (TLP): • • This is parallelism on a more coarser scale • • Server can serve each client in a separate thread (Web server, database server) • • A computer gamecan do AI, graphics, and physics in three separate threads • • Single-coresuperscalar processors cannotfully exploit TLP • • Multi-core architectures are the next step in processor evolution: explicitly exploiting TLP MSc SAI - Parallel & Concurrent (J Paul Gibson)

Multi-Core Simultaneous multithreading (SMT): • Permits multiple independent threads to execute SIMULTANEOUSLY on the SAME core • Weavingtogether multiple “threads” on the samecore MSc SAI - Parallel & Concurrent (J Paul Gibson)

Multi-Core • Combining Multi-core and SMT • • Cores can be SMT-enabled (or not) • • The differentcombinations: • – Single-core, non-SMT: standard uniprocessor • – Single-core, with SMT • – Multi-core, non-SMT • – Multi-core, with SMT: • • The number of SMT threads: • 2, 4, or sometimes 8 simultaneous threads • • Intel calls them “hyper-threads” MSc SAI - Parallel & Concurrent (J Paul Gibson)

Parallel and Concurrent Programming • There is a confusing use of terminology: • Parallel – "The simultaneous use of more than one computer to solve a problem“ • Concurrent – "Concurrent computing is a form of computing in which programs are designed as collections of interacting computational processes that may be executed in parallel " • Distributed – "A collection of (probably heterogeneous) automata whose distribution is transparent to the user so that the system appears as one local machine." • Cluster – "Multiple servers providing the same service" • Grid – “ A form of distributed computing whereby a “super virtual computer” is composed of many networked loosely coupled computers acting together to perform very large tasks. " • Cloud – "System providing access via the Internet to processing power, storage, software or other computing services." • Multitasking– "sharing a single processor between several independent jobs" • Multithreading - "a kind of multitasking with low overheads and no protection of tasks from each other, all threads share the same memory." MSc SAI - Parallel & Concurrent (J Paul Gibson)

Parallel and Concurrent Programming Some Quick RevisionTopics • DynamicLoadBalancing • Combinational Circuits • Interconnection Networks • Shared Memory • Message Passing • Classification of Parallel Architectures • Introducing MPI • Sequential to Parallel • MathematicalAnalysis - Amdahl’s Law • Compiler Techniques • Development Tools/Environments/Systems MSc SAI - Parallel & Concurrent (J Paul Gibson)

Dynamic Load Balancing • The primary sources of inefficiency in parallel code: • Poor single processor performance • Typically in the memory system • Toomuchparallelismoverhead • Thread creation, synchronization, communication • Loadimbalance • Different amounts of work across processors • Computation and communication • Different speeds (or available resources) for the processors • Possibly due to load on the machine • How to recognizing load imbalance • Time spent at synchronization is high and is uneven across • processors, but not always so simple … MSc SAI - Parallel & Concurrent (J Paul Gibson)

Dynamic Load Balancing Static load balancing --- when the amount of work allocated to each processor is calculated in advance. Dynamic load balancing --- when the loads are re-distributed at run-time. The static method is simpler to implement and is suitable when the underlying processor architecture is static. The dynamic method is more difficult to implement but is necessary when the architecture can change during run-time. When it is difficult to analyse the processing requirements of an algorithm in advance then the static method becomes less feasible. When processor speeds (allocated to the algorithm) can vary dynamically then the static approach may be very inefficient … depending on variation types. MSc SAI - Parallel & Concurrent (J Paul Gibson)

Dynamic Load Balancing • Load balancing differs with properties of the tasks: • • Taskscosts • • Do all tasks have equal costs? • • If not, when are the costs known? • Before starting, when task created, or only when task ends • • Taskdependencies • • Can all tasks be run in any order (including parallel)? • • If not, when are the dependencies known? • Before starting, when task created, or only when task ends • • Locality • • Is it important for some tasks to be scheduled on the same • processor (or nearby) to reduce communication cost? • • When is the information about communication known? MSc SAI - Parallel & Concurrent (J Paul Gibson)

Dynamic Load Balancing Task Cost Assumptions MSc SAI - Parallel & Concurrent (J Paul Gibson)

Dynamic Load Balancing Task Dependencies MSc SAI - Parallel & Concurrent (J Paul Gibson)

Dynamic Load Balancing Task Locality and Communication MSc SAI - Parallel & Concurrent (J Paul Gibson)

Dynamic Load Balancing Load balancing is well understood for parallel systems (message passing and shared memory) and there exists a wide range of solutions (both specific and generic). You should know (at the minimum) about the simplest solutions One of the most common applications of load balancing is to provide a single Internet service from multiple servers, sometimes known as a server farm. Commonly, load-balanced systems include popular web sites, large Internet Relay Chat networks, high-bandwidth File Transfer Protocol sites, Network News Transfer Protocol (NNTP) servers and Domain Name System (DNS) servers. There are many open questions concerning load balancing for the cloud and for grids. MSc SAI - Parallel & Concurrent (J Paul Gibson)

Static Load Balancing Problems --- Example 1 • There is a 4 processor system where you have no prior knowledge of processor speeds. • You have a problem which is divided into 160 equivalent tasks. • Initial load balancing: distribute tasks evenly among processors. • After 10 seconds: • Processor 1 (P1) has finished • Processor P2 has 20 tasks completed • Processor P3 has 10 tasks completed • Processor P4 has 5 tasks complete Question: what should we do? MSc SAI - Parallel & Concurrent (J Paul Gibson)

Example 1 continued ... • We can do nothing --- • Advantage: the simplest approach, just wait until all tasks are complete • Disadvantage: P1 will remain idle until all other tasks are complete (and other processes may become idle) • Rebalance by giving some of the remaining tasks to P1 --- • Advantage: P1 will no longer be idle • Disadvantage: How do we rebalance in the best way? Note: this question is not as simple as it first seems MSc SAI - Parallel & Concurrent (J Paul Gibson)

Example 1 continued … some analysis • If we do not rebalance then we can predict execution time (time to complete all tasks) using the information we have gained through analysis of the execution times of our processors --- • P4 appears to be the slowest processor and data suggests that it completes 1 task every 2 seconds Without re-balancing, we have to wait until the slowest processor (P4) has finished … 80 seconds in total. Question: what fraction of total execution time is idle time? Note: Without re-balancing we have too much idle time and have not reached optimum speed-up MSc SAI - Parallel & Concurrent (J Paul Gibson)

Example 1 continued … some more analysis The simplest re-balance: when 1 processor has become idle then evenly distribute all tasks amongst all processors So, in our example, after 10 seconds there are 85 tasks left to be completed (P2 has 20, P3 has 30, P4 has 35). We divide evenly (or as evenly as possible) --- 85 = 4*21 +1 Thus, 3 processes take 21 tasks and 1 process takes 22 tasks. Question: if we re-balance this time (but no other time) then what is the total execution time? MSc SAI - Parallel & Concurrent (J Paul Gibson)

Example 1 continued … some more analysis • The simplest re-balance is therefore an improvement. However, we should be able to do better: • Why redistribute evenly? • The processor speeds may vary greatly over time • The calculation is simple and no resources needed to store processor history • Why not rebalance more than once? • Re-balancing usually costs something • When only a few tasks are left its not worth the effort Question: in the example, assuming the processors continue at the same speed, what is total execution time if we keep on re-balancing evenly when P1 becomes idle? MSc SAI - Parallel & Concurrent (J Paul Gibson)

Re-balance Costs … example 1 revisited. Re-balancing is an intuitive concept: if it is cheap do it, if it is expensive then don’t bother. It is open to rigorous mathematical analysis: formalising the notion of cheap and expensive! • Question: If re-balancing costs: • a) 50 seconds • b) 20 seconds • c) 5 seconds • then how many re-balancing operations should be performed in order to maximise the speed-up? MSc SAI - Parallel & Concurrent (J Paul Gibson)

General re-balancing decision procedure Based on a re-balancing graph, it is simple to automate the decision making process: Tasks remaining Here, p1 is the fastest processor and we perform 3 re-balances, which are very cheap! We stop when re-balancing costs more than the time we gain! p1,p2,p3 p1,p2,p3 p1,p2,p3 p1 p1 p1 time balance balance balance balance MSc SAI - Parallel & Concurrent (J Paul Gibson)

The simplest analysis for a re-balance decision rebalance Slowest T1 Tasks Fastest X Y TimeR Y = time if re-balancing not carried out = T1/slowest rate X = time if re-balancing carried out once = ??? Rebalance if X <Y … if TimeR < ??? MSc SAI - Parallel & Concurrent (J Paul Gibson)

The complete analysis for a re-balance decision no rebalance rebalance Slowest T1 rebalance Tasks Fastest X Y TimeR Y = time if re-balancing not carried out = T1/slowest rate X = time if re-balancing carried out until (number of tasks < number of processes) = ??? Rebalance if X <Y … if TimeR < ??? MSc SAI - Parallel & Concurrent (J Paul Gibson)

Dynamic Load Balancing: Some further reading G. Cybenko. 1989. Dynamic load balancing for distributed memory multiprocessors. J. Parallel Distrib. Comput. 7, 2 (October 1989) M. H. Willebeek-LeMair and A. P. Reeves. 1993. Strategies for Dynamic Load Balancing on Highly Parallel Computers. IEEE Trans. Parallel Distrib. Syst. 4, 9 (September 1993) VipinKumar, Ananth Y. Grama, and Nageshwara Rao Vempaty. 1994. Scalableloadbalancing techniques for parallel computers. J. ParallelDistrib. Comput. 22, 1 (July 1994) ValeriaCardellini, MicheleColajanni, and Philip S. Yu. 1999. DynamicLoadBalancing on Web-Server Systems. IEEE Internet Computing 3, 3 (May 1999) MSc SAI - Parallel & Concurrent (J Paul Gibson)

Parallelism Using Combinational Circuits • A combinational circuit is a family of models of computation – • Number of inputs at one end • Number of outputs at the other end • Internally – a number of interconnected components arranged in columns called stages • Each component can be viewed as a single processor with constant fan-in and constant fan-out. • Components synchronise their computations (input to output) in a constant time unit (independent of the input values) • Computations are usually simple logical operations (directly implementable in hardware for speed!) • There must be no feedback MSc SAI - Parallel & Concurrent (J Paul Gibson)

Parallelism Using Combinational Circuits Combinational Circuits For List Processing The best known examples of CCs are those for direct hardware implementation of list processing functions. Fundamental operations of these hardware computers correspond to fundamental components. Processing tasks which are non-fundamental on a standard single processor architecture can be parallelised (to reduce their complexity) by implementing them on a different parallel machine using a number of components set up in a combinational circuit. Classic processing examples – searching, sorting, permuting, …. But what are the useful components for implementation in a CC? MSc SAI - Parallel & Concurrent (J Paul Gibson)

Parallelism Using Combinational Circuits Parallel Design --- list operations • Compositional Analysis --- use the analysis of each component to construct analysis – of speedup and efficiency - of the design. • Advantage --- re-use of already done analysis • Requires --- complexity analysis for each component. • For example, consider the following fundamental(?) operations: • (BI)PARTITION(list1) ---- constant time (no need to parallelise) • APPEND(list1,list2) ---- constant time (no need to parallelise) • and the following non-fundamental non-optimal (?) operations: • SPLIT(list1,property) ---- O(size(list1)) • SEARCH(key,directory) ---- O(size(directory)) • MERGE(list1,list2) ---- O (max(size(list1),size(list2))) • SORT (list1) ---- O(size(list1)^2) What can we do here to attack the complexity? MSc SAI - Parallel & Concurrent (J Paul Gibson)

Parallelism Using Combinational Circuits Parallel Design ---the split operation Question: how to parallelise the split operation? Answer: depends if the property is structured! Consider: M1 . . Mp L1 . . . Ln • Where: split partitions L into M and N - • Forall Mx, Property(Mx) • Forall Ny, Not(Property(Ny)) • Append(M,N) is a permutation of L SPLIT(property) N1 . . Nq Question: Can we use the structure in the Property to help parallelise the design? EXAMPLE: A ^ B, A v B, ‘any boolean expression’ MSc SAI - Parallel & Concurrent (J Paul Gibson)

Parallelism Using Combinational Circuits Example:Splitting on property A ^ B Question: what do/could we gain? append bipartition SPLIT(A) SPLIT(B) app app BIP SPLIT(A) SPLIT(B) app app Question: what about splitting on property AVB? MSc SAI - Parallel & Concurrent (J Paul Gibson)

Parallelism Using Combinational Circuits Example:Splitting on property A ^ B NEED TO DO PROBABILISTIC ANALYSIS: Typical gain when P(A) = 0.5 and P(B) = 0.5 n/4 SPLIT(A) SPLIT(B) app n/2 n n/4 app BIP SPLIT(A) n/4 SPLIT(B) n/2 app app n/4 Depth of circuit is 1+ (n/2) + (n/4) +1+1 = 3 + (3n/4) QUESTION: is this better than before? MSc SAI - Parallel & Concurrent (J Paul Gibson)

Parallelism Using Combinational Circuits Split example on non-structured property EXAMPLE: Split an input integer list into evens and odds Question: what is average speedup for the following design? SPLIT app BIP SPLIT app MSc SAI - Parallel & Concurrent (J Paul Gibson)

Parallelism Using Combinational Circuits Split example on non-structured property Question: what is average speedup for the following design? ANSWER: do Probabilistic Analysis as before … Depth = 2+ n/2 n/4 n/2 n/2 SPLIT (even/odd) app n n/4 BIP n/4 n/2 SPLIT (even/odd) app n/2 n/4 MSc SAI - Parallel & Concurrent (J Paul Gibson)

Parallelism Using Combinational Circuits Parallel Design --- the search operation Question: why is searching fundamentally different from the other components? Answer: the structure to be used for parallelisation is found in the component (directory) and not in the input data. Thus, we need to be able to cut up state and not just communication channels. Also, we need some sort of synchronisation mechanism. SEARCH (directory) SEARCH (?) data key ?? key data SEARCH (?) MSc SAI - Parallel & Concurrent (J Paul Gibson)

Parallelism Using Combinational Circuits Sorting byMerging • A good example of recursively constructing combinational circuits(CC) • The same technique can be applied to all CC’s synthesis and analysis • Requires understanding of a standard non-parallel (sequential) algorithm • Shows that some sequential algorithms are better suited to parallel implementation than others • Best suited to formal reasoning (preconditions, invariants, induction …) MSc SAI - Parallel & Concurrent (J Paul Gibson)

Parallelism Using Combinational Circuits Merging --- the base case Merge 2 sorted sequences of equal length m = 2^n. Base case, n=0 => m = 1. Precondition is met since a list with only 1 element is already sorted! The component required is actually a comparison operator Merge(1) = Compare M1 X= [x1] [min (x1,y1)] C (or CAE) Y = [y1] [max (x1,y1)] Uesful Measures: Width = 1 Depth = 1 Size = 1 MSc SAI - Parallel & Concurrent (J Paul Gibson)

Parallelism Using Combinational Circuits Merge --- the first recursive composition • QUESTION: • Using only component M1 (the comparison C), how can we construct a circuit for merging lists of length 2 (M2)? • ANALYSIS: • How many M1s … the size … are needed in total? • What is the complexity … based on the depth? • During execution what is our most efficient use of parallel resources … based on width? MSc SAI - Parallel & Concurrent (J Paul Gibson)

Parallelism Using Combinational Circuits Merge --- the first recursive composition – building M2 from a number of M1s M2 x1 M1 z1 C x2 X = [x1,x2] M1 z2 C z3 Y = [y1,y2] M1 y1 y2 C z4 Useful Measures: Width = 2 Depth = 2 Size = 3 MSc SAI - Parallel & Concurrent (J Paul Gibson)

Parallelism Using Combinational Circuits Proving M2 to be correct Validation ---We can test the circuit with different input values for X and Y But, this does not prove that the circuit is correct for all possible cases Clearly, there are equivalence classes of tests We want to identify all such classes and prove correctness for the classes. As the number of classes are finite, we can use an automated prover to do this Complete proof of all equivalence classes => system is verified. Here we have 6 equivalence classes (or 3,if we note the symmetry in swapping X and Y) DISJOINT OVERLAP CONTAINMENT x2 x1 x1 x2 x1 y1 y2 x2 y1 y2 y1 y2 MSc SAI - Parallel & Concurrent (J Paul Gibson)

Parallelism Using Combinational Circuits The next recursive step --- M4 The circuit for M2 is very easy to understand. It uses two M1s to initially merge the odd and even elements of the inputs, respectively. Then, it uses a another M1 to compare the middle values. This 2 layer architecture can be used for constructing M4: from a number of M2s and M1s… and consequently we can say M4 is constructed just from M1s!!! M4 M2 X C C Y M2 C QUESTION: what are size, width and depth?? MSc SAI - Parallel & Concurrent (J Paul Gibson)

Parallelism Using Combinational Circuits The next recursive step --- M4 M4 M2 X C C Y M2 C Depth (M4) = Depth (M2) +1 Width (M4) = Max (2*Width(M2), 3) Size (M4) = 2*Size(M2) + 3 Depth = 3 Width = 4 Size = 9 MSc SAI - Parallel & Concurrent (J Paul Gibson)

Parallelism Using Combinational Circuits The general recursive construction • We have seen how to construct M4 and M2 from M1’s, and prove the construction correct. Now we consider the general case: • Given any number of Mms how do we construct an M2m? M2m x1 C Mm x2m C y1 Mm C y2m 2m-1 C’s MSc SAI - Parallel & Concurrent (J Paul Gibson)

Parallelism Using Combinational Circuits Merge --- general analysis We can now perform a recursive analysis on the general merge circuit Mm: • Width --- Width (Mm) = 2 * width (Mm/2) = … = M • Depth --- Let d(2m) = depth of M2m, • now d(2m) = 1 + d(m), for m>1 and d(1) = 1 • => … => d(2m) = 1 + log(m) • Size ---Let s(2m) = size of M2m, • now s(2m) = 2s(m) = (m-1), for m>1 and s(1) = 1 • => … => s(2m) = 1 + mlog(m) MSc SAI - Parallel & Concurrent (J Paul Gibson)

Parallelism Using Combinational Circuits Sorting by Merging We can use the merge circuits to sort arrays --- For example, sorting an array of 8 numbers: S8 M1 M4 M2 M1 M1 M2 M1 Proof of correctness --- try to sketch the proof in your own time MSc SAI - Parallel & Concurrent (J Paul Gibson)

Parallelism Using Combinational Circuits Sorting by Merging – the analysis • Analyse the base case for sorting a 2 integer list (S2). • Synthesise and analyse S4 • What are the width, depth and size of Sn? • What about cases when n is not a power of 2? Question: is there a more efficient means of sorting using the merge components? If so, why? To DO: Look for information on parallel sorting on the web MSc SAI - Parallel & Concurrent (J Paul Gibson)

Parallelism Using Combinational Circuits Permutation Circuits An important function in computer science is to apply an arbitrary permutation to an array. We consider arrays of length m (m = 2^n) and perform a recursive composition. First, an example: Permute x1,x2,x3,x4,x5,x6,x7,x8 to x5,x4,x3,x1,x8,x7,x6,x2 The circuit can be shown as the following box: x1 x5 x4 x3 x1 x8 x7 x6 x2 Question: what goes inside? x8 MSc SAI - Parallel & Concurrent (J Paul Gibson)

Parallelism Using Combinational Circuits The simplest permutation --- a switch The base case is to permute an input array of 2 elements (a wire suffices for 1 element!) A switch has two possible states --- off or on. SWITCH A switch is therefore a programmable permutation circuit for input arrays of length 2. We use the notation P2 to denote a switch. Question: how can we re-use the switch to produce a a P4? Question: how can we re-use a Pn to produce a P2n? MSc SAI - Parallel & Concurrent (J Paul Gibson)

Parallel and Concurrent Programming