760 likes | 914 Vues
Programming Multi-Core Systems. 周枫 网易公司 2007-12-28 清华大学 http://zhoufeng.net. Last Week. A trend in software: “The Movement to Safe Languages” Improving dependability of system software without using safe languages. A Trend in Hardware. “The Movement to Multi-core Processors”
E N D
Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学 http://zhoufeng.net
Last Week • A trend in software: “The Movement to Safe Languages” • Improving dependability of system software withoutusing safe languages
A Trend in Hardware “The Movement to Multi-core Processors” • Originates from inability to increase processor clock rate • Profound impact on architecture, OS and applications Topic today: How to program multi-core systems effectively?
Why Multi-Core? • Areas of improving CPU performance in last 30 years • Clock speed • Execution optimization (Cycles-Per-Instruction) • Cache • All 3 are concurrency-agnostic • Now: 1 disappears, 2 slows down, 3 still good • 1: No 4Ghz Intel CPUs • 2: Improves with new micro-architecture (e.g. Core vs. NetBurst)
CPU Speed History From “Computer Architecture: A Quantitative Approach”, 4th edition, 2007
Reality Today • Near-term performance drivers • Multicore • Hardware threads (a.k.a. Simultaneous Multi-Threading) • Cache • 1 and 2 useful only when software is concurrent “Concurrency is the next revolution in how we write software” --- “The Free Lunch is Over”
Costs/Problems of Concurrency • Overhead of locks, message passing… • Not all programs are parallelizable • Programming concurrently is HARD • Complex concepts: mutex, read-write lock, queue… • Correct synchronization: race, deadlocks… • Getting speed-up Our focus today
Roadmap • Overview of multi-core computing • Overview of transactional programming • Case: Transactions for safe/managed languages • Case: Transactions for languages like C
Status Quo in Synchronization • Current mechanism: manual locking • Organization: lock for each shared structure • Usage: (block) acquire access release • Correctness issues • Under-locking data races • Acquires in different orders deadlock • Performance issues • Difficult to find right granularity • Overhead of acquiring vs. allowed concurrency
Transactions / Atomic Sections • Databases has provided automatic concurrency control for 30 years: ACID transactions • Vision: • Atomicity • Isolation • Serialization only on conflicts • (optional) Rollback/abort support Question: Is it possible to provide database transactionsemantics to general programming?
Transactions vs. Manual Locks • Manual locking issues: • Under-locking • Acquires in different orders • Blocking • Conservative serialization • How transactions help: • No explicit locks • No ordering • Can cancel transactions • Serialization only on conflicts Transactions: Potentially simpler and more efficient
Design Space • Hardware Transactional Memory vs. software TM • Granularity: object, word, block • Update method • Deferred: discard private copy on aborts • Direct: control access to data, erase update on aborts • Concurrency control • Pessimistic: prevent conflicts by locking • Optimistic: assumes no conflict and retry if there is • …
Hard Issues • Communications, or side-effects • File I/O • Database accesses • Interaction with other abstractions • Garbage collection • Virtual memory • Work with existing languages, synchronization primitives, …
Why software TM? • More flexible • Easier to modify and evolve • Integrate better with existing systems and languages • Not limited to fixed-size hardware structures, e.g. caches
Proposed Language Support // Insert into a doubly-linked list atomically atomic { newNode->prev = node; newNode->next = node->next; node->next->prev = newNode; node->next = newNode; } Guard condition atomic (queueSize > 0) { // remove item from queue and use it }
Intro • Optimizing Memory Transactions,Harris et al. PLDI’06 • STM system from MSR Cambridge, for MSIL (.Net) • Design features • Object-level granularity • Direct update • Version numbers to track reads • 2-phase locking to track write • Compiler optimizations
First-gen STM • Very expensive. E.g. Harris+Fraser, OOPSLA ’03): • Every load and store instruction logs information into a thread-local log • A store instruction writes the log only • A load instruction consults the log first • At the end of the block: validate the log; and atomically commit it to shared memory
Direct Update STM • Augment objects with (i) a lock, (ii) a version number • Transactional write: • Lock objects before they are written to (abort if another thread has that lock) • Log the overwritten data – we need it to restore the heap case of retry, transaction abort, or a conflict with a concurrent thread • Make the update in place to the object
Transactional read: • Log the object’s version number • Read from the object itself • Commit: • Check the version numbers of objects we’ve read • Increment the version numbers of object we’ve written, unlocking them
Example: contention between transactions c1 c2 Thread T1 Thread T2 int t = 0; atomic { t += c1.val; t += c2.val; } atomic { t = c1.val; t ++; c1.val = t; } ver = 100 ver = 200 val = 10 val = 40 T1’s log: T2’s log:
Example: contention between transactions c1 c2 Thread T1 Thread T2 int t = 0; atomic { t += c1.val; t += c2.val; } atomic { t = c1.val; t ++; c1.val = t; } ver = 100 ver = 200 val = 10 val = 40 T1’s log: T2’s log: c1.ver=100 T1 reads from c1: logs that it saw version 100
Example: contention between transactions c1 c2 Thread T1 Thread T2 int t = 0; atomic { t += c1.val; t += c2.val; } atomic { t = c1.val; t ++; c1.val = t; } ver = 100 ver = 200 val = 10 val = 40 T1’s log: T2’s log: c1.ver=100 c1.ver=100 T2 also reads from c1: logs that it saw version 100
Example: contention between transactions c1 c2 Thread T1 Thread T2 int t = 0; atomic { t += c1.val; t += c2.val; } atomic { t = c1.val; t ++; c1.val = t; } ver = 100 ver = 200 val = 10 val = 40 T1’s log: T2’s log: c1.ver=100 c2.ver=200 c1.ver=100 Suppose T1 now reads from c2, sees it at version 200
Example: contention between transactions c1 c2 Thread T1 Thread T2 int t = 0; atomic { t += c1.val; t += c2.val; } atomic { t = c1.val; t ++; c1.val = t; } ver = 200 locked:T2 val = 10 val = 40 T1’s log: T2’s log: c1.ver=100 c1.ver=100 lock: c1, 100 Before updating c1, thread T2 must lock it: record old version number
Example: contention between transactions c1 c2 Thread T1 Thread T2 int t = 0; atomic { t += c1.val; t += c2.val; } atomic { t = c1.val; t ++; c1.val = t; } ver = 200 locked:T2 val = 11 val = 40 (2) After logging the old value, T2 makes its update in place to c1 T1’s log: T2’s log: c1.ver=100 c1.ver=100 lock: c1, 100 c1.val=10 (1) Before updating c1.val, thread T2 must log the data it’s going to overwrite
Example: contention between transactions c1 c2 Thread T1 Thread T2 int t = 0; atomic { t += c1.val; t += c2.val; } atomic { t = c1.val; t ++; c1.val = t; } ver=101 ver = 200 val = 11 val = 40 (2) T2’s transaction commits successfully. Unlock the object, installing the new version number T1’s log: T2’s log: c1.ver=100 c1.ver=100 lock: c1, 100 c1.val=10 (1) Check the version we locked matches the version we previously read
Example: contention between transactions c1 c2 Thread T1 Thread T2 int t = 0; atomic { t += c1.val; t += c2.val; } atomic { t = c1.val; t ++; c1.val = t; } ver=101 ver = 200 val = 11 val = 40 T1’s log: T2’s log: c1.ver=100 c1.ver=100 lock: c1, 100 c1.val=10 (1) T1 attempts to commit. Check the versions it read are still up-to-date. (2) Object c1 was updated from version 100 to 101, so T1’s transaction is aborted and re-run.
Compiler integration • We expose decomposed log-writing operations in the compiler’s internal intermediate code (no change to MSIL) • OpenForRead – before the first time we read from an object (e.g. c1 or c2 in the examples) • OpenForUpdate – before the first time we update an object • LogOldValue – before the first time we write to a given field Source code Basic intermediate code Optimized intermediate code atomic { … t += n.value; n = n.next; … } OpenForRead(n); t = n.value; OpenForRead(n); n = n.next; OpenForRead(n); t = n.value; n = n.next;
Runtime integration – garbage collection 1. GC runs while some threads are in atomic blocks atomic { … } obj1.field = old obj2.ver = 100 obj3.locked @ 100
Runtime integration – garbage collection 1. GC runs while some threads are in atomic blocks atomic { … } obj1.field = old obj2.ver = 100 obj3.locked @ 100 2. GC visits the heap as normal – retaining objects that are needed if the blocks succeed
Runtime integration – garbage collection 3. GC visits objects reachable from refs overwritten in LogForUndo entries – retaining objects needed if any block rolls back 1. GC runs while some threads are in atomic blocks atomic { … } obj1.field = old obj2.ver = 100 obj3.locked @ 100 2. GC visits the heap as normal – retaining objects that are needed if the blocks succeed
Runtime integration – garbage collection 3. GC visits objects reachable from refs overwritten in LogForUndo entries – retaining objects needed if any block rolls back 1. GC runs while some threads are in atomic blocks atomic { … } obj1.field = old obj2.ver = 100 obj3.locked @ 100 2. GC visits the heap as normal – retaining objects that are needed if the blocks succeed 4. Discard log entries for unreachable objects: they’re dead whether or not the block succeeds
Results: Against Previous Work Fine-grained locking (2.57x) Harris+Fraser WSTM (5.69x) Coarse-grained locking (1.13x) Direct-update STM (2.04x) Normalised execution time Direct-update STM + compiler integration (1.46x) Sequential baseline (1.00x) Workload: operations on a red-black tree, 1 thread, 6:1:1 lookup:insert:delete mix with keys 0..65535 Scalable to multicore
Scalability (µ-benchmark) Coarse-grained locking Fine-grained locking WSTM (atomic blocks) DSTM (API) OSTM (API) Microseconds per operation Direct-update STM + compiler integration #threads
Results: long running tests 10.8 73 162 Direct-update STM Run-time filtering Compile-time optimizations Original application (no tx) Normalised execution time tree skip go merge-sort xlisp
Summary • A pure software implementation can perform well and scale to vast transactions • Direct update • Pessimistic & optimistic CC • Compiler support & optimizations • Still need a better understanding of realistic workload distributions
Intro • Autolocker: Synchronization Inference for Atomic SectionsBill McCloskey, Feng Zhou, David Gay, Eric Brewer, POPL 2006 • Question: Can we have language support for atomic sections for languages like C? • No object meta-data • No type safety • No garbage collection • Answer: Let programmer help a bit (w/ annotations) • And do a simpler one: NO aborts!
Autolocker: C + Atomic Sections • Shared data is protected by annotated locks • Threads access shared data in atomic sections: • Threads never deadlock (due to Autolocker) • Threads never race for protected data • How can we implement this semantics? mutex m; int shared_var protected_by(m); atomic { ... x = shared_var; ... } Code runs as if a single lock protects all atomic sections
Autolocker Transformation • Autolocker is a source-to-source transformation C code Autolocker code mutex m1, m2; int x protected_by(m1); int y protected_by(m2); atomic { x = 3; y = 2; } int m1, m2; int x; int y; begin_atomic(); acquire(m1); x = 3; acquire(m2); y = 2; end_atomic();
Autolocker Transformation • Autolocker is a source-to-source transformation C code Autolocker code mutex m1, m2; int x protected_by(m1); int y protected_by(m2); atomic { x = 3; y = 2; } int m1, m2; int x; int y; begin_atomic(); acquire(m1); x = 3; acquire(m2); y = 2; end_atomic(); Atomic sections can be nested arbitrarily The nesting level is tracked at runtime
Autolocker Transformation • Autolocker is a source-to-source transformation C code Autolocker code mutex m1, m2; int x protected_by(m1); int y protected_by(m2); atomic { x = 3; y = 2; } int m1, m2; int x; int y; begin_atomic(); acquire(m1); x = 3; acquire(m2); y = 2; end_atomic(); Locks are acquired as needed Lock acquisitions are reentrant
Autolocker Transformation • Autolocker is a source-to-source transformation C code Autolocker code mutex m1, m2; int x protected_by(m1); int y protected_by(m2); atomic { x = 3; y = 2; } int m1, m2; int x; int y; begin_atomic(); acquire(m1); x = 3; acquire(m2); y = 2; end_atomic(); Locks are released when outermost section ends Strict two-phase locking: guarantees atomicity
Autolocker Transformation • Autolocker is a source-to-source transformation C code Autolocker code mutex m1, m2; int x protected_by(m1); int y protected_by(m2); atomic { x = 3; y = 2; } int m1, m2; int x; int y; begin_atomic(); acquire(m1); x = 3; acquire(m2); y = 2; end_atomic(); Locks are acquired in a global order Acquiring locks in order will never lead to deadlock
Outline • Introduction • semantics • usage model and benefits • Autolocker algorithm • match locks to data • order lock acquisitions • insert lock acquisitions • Related work • Experimental evaluation • Conclusion
Autolocker Usage Model • Typical process for writing threaded software: • Linux kernel evolved to SMP this way • Autolocker helps you program in this style start here one coarse-grained lock lots of contention little parallelism finish here many fine-grained locks low contention high parallelism shared data lock threads
Granularity and Autolocker • In Autolocker, annotations control performance: • Simpler than fixing all uses of shared_var • Changing annotations won’t introduce bugs • no deadlocks • no new race conditions int shared_var protected_by(kernel_lock); int shared_var protected_by(fs->lock);
Outline • Introduction • semantics • usage model and benefits • Autolocker algorithm • match locks to data • order lock acquisitions • insert lock acquisitions • Related work • Experimental evaluation • Conclusion