Lecture 3 (Complexities of Parallelism)

Programming Multi-Core Processors based Embedded SystemsA Hands-On Experience on Cavium Octeon based Platforms Lecture 3 (Complexities of Parallelism)

Course Outline • Introduction • Multi-threading on multi-core processors • Multi-core applications and their complexities • Multi-core parallel applications • Complexities of multi-threading and parallelism • Application layer computing on multi-core • Performance measurement and tuning 3-2

Agenda for Today • Multi-core parallel applications space • Scientific/engineering applications • Commercial applications • Complexities due to parallelism • Threading related issues • Memory consistency and cache coherence • Synchronization 3-3

Parallel Applications Science/engineering application, general-purpose application, and desktop applications David E. Culler and Jaswinder Pal Singh, Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann, 1998

Parallel Application Trends • There is an ever-increasing demand for high performance computing in a number of application areas • Scientific and engineering applications: • Computational fluid dynamics • Weather modeling • Number of applications from physics, chemistry, biology, etc. • General-purpose computing applications • Video encoding/decoding, graphics, games • Database management • Networking applications 3-5

Performance (p cores) Performance (1 core) Time (1 core) Time (p cores) Application Trends (2) • Demand for cycles fuels advances in hardware, and vice-versa • Cycle drives exponential increase in microprocessor performance • Drives parallel architecture harder: most demanding applications • Range of performance demands • Need range of system performance with progressively increasing cost • Platform pyramid • Goal of applications in using multi-core machines: Speedup Speedup (p cores) = • For a fixed problem size (input data set), performance = 1/time Speedup fixed problem (p cores) = 3-6

Scientific Computing Demand 3-7

Engineering Application Demands • Large parallel machines a mainstay in many industries • Petroleum (reservoir analysis) • Automotive (crash simulation, drag analysis, combustion efficiency), • Aeronautics (airflow analysis, engine efficiency, structural mechanics, electromagnetism), • Computer-aided design • Pharmaceuticals (molecular modeling) • Visualization • in all of the above • entertainment (films like Toy Story) • architecture (walk-throughs and rendering) 3-8

Application Trends Example: ASCI • Accelerated Strategic Computing Initiative (ASCI) is a US DoE program that proposes the use of high performance computing for 3-D modeling and simulation • Promised to provide 5 orders of magnitude greater computing power in 8 years (1996 to 2004) than state-of-the-art (1 GFlops to 100 Tflops) 3-9

Application Trends Example (2) • Platforms • ASCI Red • 3.1 TOPs peak performance • Developed by Intel with 4,510 nodes • ASCI Blue Mountain • 3 TOPs peak performance • Developed by SGI with 48, 128 node Origin2000s • ASCI White • 12 TOPs peak performance • Developed by IBM as cluster of SMPs 3-10

Commercial Applications • Databases, online-transaction processing, decision support, data mining • Also relies on parallelism for high end • Scale not so large, but use much more wide-spread • High performance means performing more work (transactions) in a fixed time 3-11

Commercial Applications (2) • TPC benchmarks (TPC-C order entry, TPC-D decision support) • Explicit scaling criteria provided • Size of enterprise scales with size of system • Problem size no longer fixed as p increases, so throughput is used as a performance measure (transactions per minute or tpm) • Desktop applications • Video applications • Secure computing and web services 3-12

Parallel Applications Landscape HPCC (Science/ engineering) Data Center Appls. (Search, e-commerce, Enterprise, SOA) Desktop Applications (WWW browser, office, multimedia applications) Embedded Applications (Wireless and mobile devices, PDAs, consumer electronics) 3-13

Summary of Application Trends • Transition to parallel computing has occurred for scientific and engineering computing • In rapid progress in commercial computing • Desktop also uses multithreaded programs, which are a lot like parallel programs • Demand for improving throughput on sequential workloads • Greatest use of small-scale multiprocessors • Currently employ multi-core processors • Solid application demand exists and will increase 3-14

Solutions to Common Parallel Programming Problems using Multiple Threads Chapter 7 Shameem Akhtar and Jason Roberts, Multi-Core Programming, Intel Press, 2006

Too many threads Data races, deadlocks, and live locks Heavily contended locks Non-blocking algorithms Thread-safe functions and libraries Memory issues Cache related issues Pipeline stalls Date organization Common Problems 3-16

Too Many Threads • Little threading good  many will be great • Not always true • Excessive threading can degrade performance • Two types of impacts of excessive threads • Too little work per thread • Overhead of starting and maintaining dominates • Fine granularity of work hides any performance benefits • Excessive contention for hardware resources • OS uses time-slicing for fair scheduling • May result in excessive context switching overhead • Thrashing at virtual memory level 3-17

Data Races, Deadlocks, and Livelocks • Race condition • Due to unsynchronized accesses to shared data • Program results are non-deterministic • Depend on relative timings of threads • Can be handled through locking • Deadlock • A problem due to incorrect locking • Results due to cyclic dependence that stops forward progress by threads • Livelock • Thread continuously conflict with each other and back off • No thread makes any progress • Solution: back off with release of acquired locks to allow at least one thread to make progress 3-18

Races among Unsynchronized Threads 3-19

Race Conditions Hiding Behind Language Syntax 3-20

A Higher-Level Race Condition Example • Race conditions possible with synch • However, synchronization at too low level • Higher level may still have data races • Example • Each key should occur only once in the list • Individual list operators have locks • Problem: two threads simultaneously may find that key does not exist and insert the same key in the list one after the other • Solution: locking both for list as well as to protect key repetition 3-21

Deadlock Caused by Cycle 3-22

Conditions for a Deadlock Deadlock can occur only if the following four conditions are true: • Access to each resource is exclusive; • A thread is allowed to hold one resource requesting another; • No thread is willing to relinquish a resource that it has acquired; and • There is a cycle of threads trying to acquire resources, where each resource is held by one thread and requested by another 3-23

Locks Ordered by their Addresses • Consistent ordering of lock acquisition • Prevents deadlock 3-24

Try and Backoff Logic • One reason for deadlocks: no thread willing to give up a resource • Solution: thread gives up resource if it cannot acquire another one 3-25

Heavily Contested Locks • Locks ensure correctness • By preventing race conditions • By preventing deadlocks • Performance impact • When locks become heavily contested among threads • Threads try to acquire the lock at a rate faster than the rate at which a thread can execute the corresponding critical section • If a thread falls asleep, all threads have to wait for it 3-26

Priority Inversion Scenario 3-27

Solution: Spreading out Contention 3-28

Hash Table with Fine-Grained Locking • Mutexes protecting each bucket 3-29

Non-Blocking Algorithms • How about not using locks at all! • To resolve the locking problems • Such algorithms are called non-blocking • Stopping one thread does not prevent rest of the system from making progress • Non-blocking guarantees: • Obstruction freedom—thread makes progress as long as no contention  livelock possible  uses exponential backoff to avoid it • Lock freedom—system as a whole makes progress • Wait freedom—every thread makes progress even when faced with contention  practically difficult to achieve 3-30

Thread-Safe Functions • Thread-safe function  when concurrently called on different objects • Implementer should ensure thread safety of hidden shared state 3-31

Memory Issues • Speed disparity • Processing is fast • Memory access is slow • Multiple cores can exacerbate the problem • Specific memroy issues • Bandwidth • Working in the cache • Memory contention • Memory consistency 3-32

Bandwidth 3-33

Working in the Cache 3-34

Memory Contention • Types of memory accesses • Between a core and main memory • Between two cores • Two types of data dependences: • Read-write dependency: a core write a cache line and then different core reads it • Write-write dependency: a cores write a cache line and then a different core writes it • Interactions among cores • Consume bandwidth • Are avoided when multiple cores only read from cache lines • Can be avoided by minimizing the shared locations 3-35

False Sharing • Cache block may also introduce artifacts • Two distinct variables in the same cache block • Technique: allocate data used by each processor contiguously, or at least avoid interleaving in memory • Example problem: an array of ints, one written frequently by each processor (many ints per cache line) 3-36

Performance Impact of False Sharing 3-37

What is Memory Consistency? 3-38

Itanium Architecture 3-39

Shared Memory without a Lock 3-40

Memory Consistency and Cache Coherence David E. Culler and Jaswinder Pal Singh, Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann, 1998 (Advanced Topics—can be skipped)

Memory Consistency for Multi-Core Architectures • Memory consistency issue • Programs are written for a conceptual sequential machine with memory • Programs for parallel architectures: • Written for multiple concurrent instruction streams • Memory accesses may occur in any order • May result in incorrect comupation • This is a well-known problem • Traditional parallel architecture deal with it • Multi-core architectures inherit this complexity • Presented in this section for sake of completion • More relevant for HPCC applications • Not as complex for multi-threading  thread level solutions 3-42

Memory Consistency • Consistency requirement:writes to a location become visible to all in the same order • But when does a write become visible • How to establish orders between a write and a read by different process? • Typically use event synchronization • By using more than one location 3-43

P P 1 2 /*Assume initial value of A and flag is 0*/ /*spin idly*/ A = 1; while (flag == 0); flag = 1; print A; Memory Consistency (2) • Sometimes expect memory to respect order between accesses to different locations issued by a given processor • to preserve orders among accesses to same location by different processes • Coherence doesn’t help: pertains only to single location 3-44

P P 1 2 /*Assume initial values of A and B are 0*/ (1a) A = 1; (2a) print B; (1b) B = 2; (2b) print A; An Example of Orders • We need an ordering model for clear semantics • across different locations as well • so programmers can reason about what results are possible • This is the memory consistency model 3-45

Memory Consistency Model • Specifies constraints on the order in which memory operations (from any process) can appear to execute with respect to one another • What orders are preserved? • Given a load, constrains the possible values returned by it • Without it, can’t tell much about an SAS program’s execution 3-46

Memory Consistency Model (2) • Implications for both programmer and system designer • Programmer uses to reason about correctness and possible results • System designer can use to constrain how much accesses can be reordered by compiler or hardware • Contract between programmer and system 3-47

Sequential Consistency • (as if there were no caches, and a single memory) 3-48

Sequential Consistency (2) • Total order achieved by interleaving accesses from different processes • Maintains program order, and memory operations, from all processes, appear to [issue, execute, complete] atomically w.r.t. others • Programmer’s intuition is maintained • “A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.” [Lamport, 1979] 3-49

What Really is Program Order? • Intuitively, order in which operations appear in source code • Straightforward translation of source code to assembly • At most one memory operation per instruction • But not the same as order presented to hardware by compiler • So which is program order? • Depends on which layer, and who’s doing the reasoning • We assume order as seen by programmer 3-50

Lecture 3 (Complexities of Parallelism)