Optimizing memory transactions

Optimizing memory transactions Tim Harris, Mark Plesko, Avi Shinnar, David Tarditi

The Big Question: are atomic blocks feasible? • Atomic blocks may be great for the programmer; but can they be implemented with acceptable performance? • At first, atomic blocks look insanely expensive. A recent implementation (Harris+Fraser, OOPSLA ’03): • Every load and store instruction logs information into a thread-local log • A store instruction writes the log only • A load instruction consults the log first • At the end of the block: validate the log; and atomically commit it to shared memory • Assumptions throughout this talk: • Reads outnumber writes (3:1 or more) • Conflicts are rare

State of the art ~ 2003 Fine-grained locking (2.57x) Harris+Fraser WSTM (5.69x) Coarse-grained locking (1.13x) Normalised execution time Sequential baseline (1.00x) Workload: operations on a red-black tree, 1 thread, 6:1:1 lookup:insert:delete mix with keys 0..65535

Our new techniques prototyped in Bartok • Direct-update STM • Allow transactions to make updates in place in the heap • Avoids reads needing to search the log to see earlier writes that the transaction has made • Makes successful commit operations faster at the cost of extra work on contention or when a transaction aborts • Compiler integration • Decompose the transactional memory operations into primitives • Expose the primitives to compiler optimization (e.g. to hoist concurrency control operations out of a loop) • Runtime system integration • Integration with the garbage collector or runtime system components to scale to atomic blocks containing 100M memory accesses

Results: concurrency control overhead Fine-grained locking (2.57x) Harris+Fraser WSTM (5.69x) Coarse-grained locking (1.13x) Direct-update STM (2.04x) Normalised execution time Direct-update STM + compiler integration (1.46x) Sequential baseline (1.00x) Workload: operations on a red-black tree, 1 thread, 6:1:1 lookup:insert:delete mix with keys 0..65535 Scalable to multicore

Results: scalability Coarse-grained locking Fine-grained locking WSTM (atomic blocks) DSTM (API) OSTM (API) Microseconds per operation Direct-update STM + compiler integration #threads

Direct update STM • Augment objects with (i) a lock, (ii) a version number • Transactional write: • Lock objects before they are written to (abort if another thread has that lock) • Log the overwritten data – we need it to restore the heap case of retry, transaction abort, or a conflict with a concurrent thread • Make the update in place to the object • Transactional read: • Log the object’s version number • Read from the object itself • Commit: • Check the version numbers of objects we’ve read • Increment the version numbers of object we’ve written, unlocking them

Thread T2 Thread T1 c1 c2 atomic { t = c1.val; t ++; c1.val = t; } int t = 0; atomic { t += c1.val; t += c2.val; } ver = 100 ver = 200 val = 10 val = 40 Example: contention between transactions T1’s log: T2’s log:

Thread T2 Thread T1 c1 c2 atomic { t = c1.val; t ++; c1.val = t; } int t = 0; atomic { t += c1.val; t += c2.val; } ver = 100 ver = 200 val = 10 val = 40 Example: contention between transactions T1’s log: T2’s log: c1.ver=100 T1 reads from c1: logs that it saw version 100

Thread T2 Thread T1 c1 c2 atomic { t = c1.val; t ++; c1.val = t; } int t = 0; atomic { t += c1.val; t += c2.val; } ver = 100 ver = 200 val = 10 val = 40 Example: contention between transactions T1’s log: T2’s log: c1.ver=100 c1.ver=100 T2 also reads from c1: logs that it saw version 100

Thread T2 Thread T1 c1 c2 atomic { t = c1.val; t ++; c1.val = t; } int t = 0; atomic { t += c1.val; t += c2.val; } ver = 100 ver = 200 val = 10 val = 40 Example: contention between transactions T1’s log: T2’s log: c1.ver=100 c2.ver=200 c1.ver=100 Suppose T1 now reads from c2, sees it at version 200

Thread T2 Thread T1 c1 c2 atomic { t = c1.val; t ++; c1.val = t; } int t = 0; atomic { t += c1.val; t += c2.val; } locked:T2 ver = 200 val = 10 val = 40 Example: contention between transactions T1’s log: T2’s log: c1.ver=100 c2.ver=200 c1.ver=100 lock: c1, 100 Before updating c1, thread T2 must lock it: record old version number

Thread T2 Thread T1 c1 c2 atomic { t = c1.val; t ++; c1.val = t; } int t = 0; atomic { t += c1.val; t += c2.val; } locked:T2 ver = 200 val = 11 val = 40 Example: contention between transactions (2) After logging the old value, T2 makes its update in place to c1 T1’s log: T2’s log: c1.ver=100 c2.ver=200 c1.ver=100 lock: c1, 100 c1.val=10 (1) Before updating c1.val, thread T2 must log the data it’s going to overwrite

Thread T2 Thread T1 c1 c2 atomic { t = c1.val; t ++; c1.val = t; } int t = 0; atomic { t += c1.val; t += c2.val; } ver=101 ver = 200 val = 10 val = 40 Example: contention between transactions (2) T2’s transaction commits successfully. Unlock the object, installing the new version number T1’s log: T2’s log: c1.ver=100 c2.ver=200 c1.ver=100 lock: c1, 100 c1.val=10 (1) Check the version we locked matches the version we previously read

Thread T2 Thread T1 c1 c2 atomic { t = c1.val; t ++; c1.val = t; } int t = 0; atomic { t += c1.val; t += c2.val; } ver=101 ver = 200 val = 10 val = 40 Example: contention between transactions T1’s log: T2’s log: c1.ver=100 c2.ver=100 (1) T1 attempts to commit. Check the versions it read are still up-to-date. (2) Object c1 was updated from version 100 to 101, so T1’s transaction is aborted and re-run.

Compiler integration • We expose decomposed log-writing operations in the compiler’s internal intermediate code (no change to MSIL) • OpenForRead – before the first time we read from an object (e.g. c1 or c2 in the examples) • OpenForUpdate – before the first time we update an object • LogOldValue – before the first time we write to a given field Source code Basic intermediate code Optimized intermediate code atomic { … t += n.value; n = n.next; … } OpenForRead(n); t = n.value; OpenForRead(n); n = n.next; OpenForRead(n); t = n.value; n = n.next;

Compiler integration – avoiding upgrades Compiler’s intermediate code Optimized intermediate code Source code OpenForRead(c1); temp1 = c1.val; temp1 ++; OpenForUpdate(c1); LogOldValue(&c1.val); c1.val = temp1; atomic { … c1.val ++; … } OpenForUpdate(c1); temp1 = c1.val; temp1 ++; LogOldValue(&c1.val); c1.val = temp1

Compiler integration – other optimizations • Hoist OpenFor* and Log* operations out from loops • Avoid OpenFor* and Log* operations on objects allocated inside atomic blocks (these objects must be thread local) • Move OpenFor* operations from methods to their callers • Further decompose operations to allow logging-space checks to be combined • Expose OpenFor* and Log*’s implementation to inlining and further optimization

What about… version wrap-around Commit, set version 17 Commit, set version 18 Open for update, see version 16 Open for update, see version 17 … time Commit: obj1 back to version 17 – oops Open obj1 for read, see version 17 • Solution: validate read log at each GC, force GC at least once every #versions transactions

Runtime integration – garbage collection 1. GC runs while some threads are in atomic blocks atomic { … } obj1.field = old obj2.ver = 100 obj3.locked @ 100

Runtime integration – garbage collection 1. GC runs while some threads are in atomic blocks atomic { … } obj1.field = old obj2.ver = 100 obj3.locked @ 100 2. GC visits the heap as normal – retaining objects that are needed if the blocks succeed

Runtime integration – garbage collection 3. GC visits objects reachable from refs overwritten in LogForUndo entries – retaining objects needed if any block rolls back 1. GC runs while some threads are in atomic blocks atomic { … } obj1.field = old obj2.ver = 100 obj3.locked @ 100 2. GC visits the heap as normal – retaining objects that are needed if the blocks succeed

Runtime integration – garbage collection 3. GC visits objects reachable from refs overwritten in LogForUndo entries – retaining objects needed if any block rolls back 1. GC runs while some threads are in atomic blocks atomic { … } obj1.field = old obj2.ver = 100 obj3.locked @ 100 2. GC visits the heap as normal – retaining objects that are needed if the blocks succeed 4. Discard log entries for unreachable objects: they’re dead whether or not the block succeeds

Results: long running tests 10.8 73 162 Direct-update STM Run-time filtering Compile-time optimizations Original application (no tx) Normalised execution time tree skip go merge-sort xlisp

Conclusions • Atomic blocks and transactional memory provide a fundamental improvement to the abstractions for building shared-memory concurrent data structures • Only one part of the concurrency landscape • But we believe no one solution will solve all the challenges there • Our experience with compiler integration shows that a pure software implementation can perform well for short transactions and scale to vast transactions (many accesses & many locations) • We still need a better understanding of realistic workload distributions

Backup slides • Backup slides beyond this point

‘Parallelism preserving’ design • Any updates must pull data into the local cache in exclusive mode • Even if an operation is `read only’, acquiring a multi-reader lock will involve fetching a line in exclusive mode • Our optimistic design lets data shared by multiple cores remain cached by all of them • Scalability at the cost of wasted work when optimism does not pay off Data held in shared mode in multiple caches S Core 1 Core 2 Core 3 Core 4 S E S S L1 L1 L1 L1 L2 L2 Data held in exclusive mode in a single cache E L3

try { … if (node.Right != this.sentinelNode) … } catch (AtomicFakeException) { } l_1 = ObjectField_open_read<Right>(t_0) t_1 = ObjectField_open_read<sentinelNode>(l_0) t_2 = Neq<bool>(l_1, t_1) OpentObjForRead(t_0) l_1 = ObjectField<Right>(t_0) OpenObjForRead(l_0) t_1 = ObjectField<sentinelNode>(l_0) t_2 = Neq(l_0, t_1) t = GetCurrentThread m = ObjectField<TryAllManager>(t) ReadLogReserve(m, 2) OpenObjForReadFast(m, t_0) l_1 = ObjectField<Right>(t_0) OpenObjForReadFast(m, l_0) t_1 = ObjectField<sentinelNode>(m, l_0) t_2 = Neq(l_0, t_1) Compilation MSIL + atomic blocks boundaries IR + cloned atomic code IR + explicit STM operations IR + low-level STM operations

Some examples (xlisp garbage collector) /* follow the left sublist if there is one */ if (livecar(xl_this)) { xl_this.n_flags |= (byte)LEFT; tmp = prev; prev = xl_this; xl_this = prev.p; prev.p = tmp; } Open ‘prev’ for update here to avoid an inevitable upgrade

Some examples (othello) public int PlayerPos (int xm, int ym, int opponent, bool upDate) { int rotate; // 8 degrees of rotation int x, y; bool endTurn; bool plotPos; int turnOver = 0; // inital checking ! if (this.Board[ym,xm] != BInfo.Blank) return turnOver; // can't overwrite player Calls to PlayerPos must open ‘this’ for read: do this in the caller

Basic design: open for read Transactional version number 0 00 2. Copy meta data vtable 1. Store obj ref fields… Read objects log

1 00 Basic design: open for update Transactional version number 0 00 3. CAS to acquire object 2. Copy meta data vtable 1. Store obj ref fields… Updated objects log

Transactional version number Hash code Lock word 00 01 10 11 Version+1 Version Hashcode Hashcode Lock word Lock word Multi-use header word vtable fields…

? ? Tag’ 00 Hash value Filtering duplicate log entries • Per-thread table of recently logged objects / addresses • Fast O(1) logical clearing by embedding transaction sequence numbers in entries Tag Hash value 00 ^ seq Hash value ^ seq

Semantics of atomic blocks • I/O • Details • Workloads

Challenges & opportunities • Moving away from locks as a programming abstraction lets us re-think much of the concurrent programming landscape Atomic blocks for synchronization and shared state access Explicit threads only for explicit concurrency CILK-style parallel loops and calls Application software Data-parallel libraries Managed runtime &STM implementation Re-use STM mechanisms for TLS and optimistic (in the technical sense) parallel loops etc Multi-core / many-core hardware H/W TM or TM-acceleration

Optimizing memory transactions

Optimizing memory transactions

Presentation Transcript

SAS: Managing Memory and Optimizing System Performance

Phase Reconciliation for Contended In-Memory Transactions

Optimizing Memory Accesses for Spatial Computation

Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory

EffiCuts: Optimizing Packet Classification for Memory and Throughput

Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Optimizing Power @ Design Time Memory

Elastic Transactions Unleashing Concurrency of Transactional Memory

Optimizing Memory for Smooth Infinite Scrolling

On Transactional Memory, Spinlocks and Database Transactions

Silo : Speedy Transactions in Multicore In-Memory Databases

Transactional Memory Supporting Large Transactions

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Scheduling Memory Transactions

xCalls: Safe I/O in Memory Transactions

SAS: Managing Memory and Optimizing System Performance

Optimizing Multidimensional Index Trees for Main Memory Access

Transactions

Optimizing memory transactions

Scheduling Memory Transactions

Transactions