Parallel Algorithms

Parallel Algorithms CET306 Harry R. Erwin University of Sunderland

Roadmap • Theoretical Models • Turing Machine (TM) • Von Neumann Machine (VNM) • Random Access Machine (RAM) • Parallel Random Access Machine (PRAM) • Policies • Shared-Memory Programming • Distributed-Memory Programming • Portable Libraries • PVM • MPI • Critical Comparison • Parallel Patterns

Texts • Clay Breshears (2009) The Art of Concurrency: A Thread Monkey's Guide to Writing Parallel Applications, O'Reilly Media, Pages: 304. • Mordechai Ben-Ari (2006) Principles of Concurrent and Distributed Programming, Addison-Wesley.

Theoretical Models • Turing Machine (TM) • Von Neumann Machine (VNM) • Random Access Machine (RAM) • Parallel Random Access Machine (PRAM)

Turing Machine (TM)From Wikipedia • Turing wrote that the Turing machine, here called a Logical Computing Machine, consisted of: • “...an infinite memory capacity obtained in the form of an infinite tape marked out into squares, on each of which a symbol could be printed. At any moment there is one symbol in the machine; it is called the scanned symbol. The machine can alter the scanned symbol and its behaviour is in part determined by that symbol, but the symbols on the tape elsewhere do not affect the behaviour of the machine. However, the tape can be moved back and forth through the machine, this being one of the elementary operations of the machine. Any symbol on the tape may therefore eventually have an innings.” (Turing 1948, p. 61)

Commentary • You can think of a Turing Machine as automating what a mathematician does in proving a statement. • The tape is the current state of a proof, and the question is whether the Turing Machine ever stops (having successfully proven the statement). That is provably unsolvable. • Any Turing Machine can be simulated by a Universal Turing Machine (UTM), with a ‘program’ at the beginning of the tape, followed by the statement to be proven. • All digital computer programs are special cases of this. • Analogue computers introduce Super-Turing Machines.

Von Neumann Machine (VNM) or Architecture (VNA) • (Wikipedia) “This describes a design architecture for an electronic digital computer with subdivisions of a central arithmetic part, a central control part, a memory to store both data and instructions, external storage, and input and output mechanisms. The meaning of the phrase has evolved to mean a stored-program computer in which an instruction fetch and a data operation cannot occur at the same time because they share a common bus. This is referred to as the Von Neumann bottleneck and often limits the performance of the system.”

Commentary • (Wikipedia) “The design of a Von Neumann architecture is simpler than the more modern Harvard architecture which is also a stored-program system but has one dedicated address and data buses for memory, and another set of address and data buses for fetching instructions.” • “A stored-program digital computer is one that keeps its programmed instructions, as well as its data, in read-write, random-access memory (RAM). In the vast majority of modern computers, the same memory is used for both data and program instructions.” • (Backus quoted on Wikipedia) “The shared bus between the program memory and data memory leads to the Von Neumann bottleneck, the limited throughput (data transfer rate) between the CPU and memory compared to the amount of memory.”

Harvard Architecture • (Wikipedia) “The Harvard architecture is a computer architecture with physically separate storage and signal pathways for instructions and data. The term originated from the Harvard Mark I relay-based computer, which stored instructions on punched tape (24 bits wide) and data in electro-mechanical counters. These early machines had data storage entirely contained within the central processing unit, and provided no access to the instruction storage as data. Programs needed to be loaded by an operator; the processor could not boot itself.” • “Today, most processors implement such separate signal pathways for performance reasons but actually implement a Modified Harvard architecture, so they can support tasks such as loading a program from disk storage as data and then executing it.” • Security suggests data and instructions should be stored in separate areas, and instructions should be non-modifiable. The Modified Harvard architecture needs to be modified further to support this.

Random Access Machine (RAM) • Simplified Von Neumann Machine • Can be given multiple storage levels • Note the difference between a Von Neumann Machine and a Harvard Machine is ignored. • CPU • Input • Random Access Memory (RAM) • Output

Parallel Random Access Machine (PRAM) • Pronounced “P-ram” • At its simplest, consists of multiple CPUs accessing a common memory of unlimited size. • Shared clock—one instruction per cycle • Memory access performance among the CPUs is identical.

PRAM Models • Concurrent Read, Concurrent Write (CRCW) • Multiple threads can read and write a common memory location at the same time. • Concurrent Read, Exclusive Write (CREW) • Multiple threads can read and one thread can write a common memory location at the same time. • Exclusive Read, Concurrent Write (ERCW) • One thread can read and multiple threads can write a common memory location at the same time. • Exclusive Read, Exclusive Write (EREW) • One thread can read and one thread can write a common memory location at the same time. • Policies • The PRAM algorithm sorts out the interaction.

Policies • Who actually gets access during exclusive read or write operations. • What gets written in a concurrent write operation. • Ensure the same value is written • Random choice • Some logical, arithmetic, or illogical combination of the values being written.

Programming • Shared-Memory Programming • Distributed-Memory Programming • Portable Libraries • PVM • MPI • Critical Comparison of programming models

Shared-Memory Programming • Petered out in 1985-95 with a limit of about 32 processors due to bus contention.

Distributed-Memory Programming • Some of the memory in the system is allocated to individual processors and some is shared. • The processors need to collaborate—mostly handled by message-passing. • PVI • MPI • Beowulf clusters showed how to combine PCs using MPI to get high performance. We have a cluster at Sunderland, and C. Panchev knows this area.

Critical Comparison • Features common to shared-memory and distributed-memory programming • There is no free lunch. Some parts of your program will have to run serially. • Management is unavoidable. The work has to be divided up. You can exploit data parallelism, or you can split the parts of the job among processors. • Data have to be shared. Live with it. • You can allocate work on the fly or you can plan it.

Shared Memory Issues • Threads will need their private memory areas. Usually you can do this by allocating thread-local memory. This can be for a given method execution, or you can use thread-local storage that stays with a thread. • Performance of data access will be an issue. Think about storage conflicts and data races. • Communication in memory involves synchronisation. • You will need mutual exclusion or synchronisation primitives. Learn about them. • Learn about producer/consumer or boss/worker protocols. • Learn about reader/writer locks.

Pattern Languages • Alexander (1977) invented pattern languages as practical tools for describing architectural expertise in some domain. • The elements of a pattern language are patterns. Each pattern describes a problem that occurs over and over again and the core of the solution to that problem in such a way that it can be reused many times, never once the same way. • A pattern isn’t considered proven until it has been used at least three times in real applications.

Design Patterns • The four essential elements (Gamma, et al) of a design pattern are: • A descriptive name • A problem description that shows when to apply the pattern and to what contexts. The description also explains how it helps to complete larger patterns. • A solution that abstractly describes the constituent elements, their relationships, responsibilities, and collaborations. • The results and trade-offs that should be taken into account when applying the pattern.

Pattern Resources • Gamma, Helm, Johnson, and Vlissides, 1995, Design Patterns, Addison-Wesley. • The Portland Pattern Repository: http://c2.com/ppr/ • Resources on Parallel Patterns http://www.cs.uiuc.edu/homes/snir/PPP/ • Visual Studio 2010 and the Parallel Patterns Library http://msdn.microsoft.com/en-us/magazine/dd434652.aspxhttp://www.microsoft.com/download/en/details.aspx?id=19222http://msdn.microsoft.com/en-us/library/dd492418.aspx • Alexander, 1977, A Pattern Language: Towns/Buildings/ Construction, Oxford University Press. (For historical interest.)

Some Parallel Patterns • Source: Williams, A (2011) “Picking Patterns for Parallel Programs (Part 1)”, Overload, 105, 15-17. • Loop Parallelism • Fork/Join • Pipelines • Actor • Speculative Execution

Loop Parallelism • Problem • There is a for loop that operates on many independent data items. • Solution • Parallelise the for loop. The operation should depend only on the loop counter, and the individual loop iterations should not interact. • Positives • Scales very nicely. • Very common. • Negatives • Overhead of setting up the thread. • Avoid if there is interaction as the individual iterations may execute in any order.

Fork/Join • Problem • The task can be broken into two or more parts that can be run in parallel. • Solution • Use a thread for each part. This can also be recursive. • Positives • Handles part interaction better than Loop Parallelism. • Works best at the top level of the application. • Negatives • Needs to be managed centrally so that hardware parallelism is utilised efficiently. • Overhead of threads. • Bursty parallelism. • Uneven workloads.

Pipelines • Problem • You have a set of tasks to be applied in turn to data. First-in, first-out. • This problem shows up in sensor data processing a lot. • Solution • Set up the tasks to run in parallel. • Fill the input queue. • Positives • Adapted well to heterogeneous hardware configurations. • Negatives • Setting it up. • Ensuring that the tasks have similar durations to avoid a rate-limiting step. • Cache interaction during transfers between pipeline stages.

Actor • Problem • Message-passing object-orientation with concurrency. • Message sending is asynchronous. • Response processing uses call-backs. • Solution • Objects communicating (only) via message queues • Positives • Actors can be analysed independently. • Avoids data races. • Negatives • Setup and queue management overhead. • Not good for short-lived threads. • Not an ideal communications mechanism. • Limited scalability.

Speculative Execution • Problem • There’s an optional path that may be required for a solution, but it takes a lot of time. • Solution • Start it early and cancel it if it’s not needed. • This is part of how BI works. • Part of why time travel implies P==NP. • Positives • Exploits parallelism. • Likely to improve performance. • Negatives • Wastes energy and resources. • Interferes with other use of parallelism.

Conclusion • We’ve explored some of the concepts of shared memory and distributed memory programming. • I’ve also introduced patterns. • The tutorial is about the dining philosophers problem. There’s a lot on the web, including a few C# versions. Try to solve it on your own first.

Parallel Algorithms