Compiler and Runtime Support for Efficient Software Transactional Memory

Compiler and Runtime Supportfor EfficientSoftware Transactional Memory Vijay Menon Programming Systems Lab Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian R. Murphy, Bratin Saha, Tatiana Shpeisman

Motivation Locks are hard to get right • Programmability vs scalability Transactional memory is appealing alternative • Simpler programming model • Stronger guarantees • Atomicity, Consistency, Isolation • Deadlock avoidance • Closer to programmer intent • Scalable implementations Questions • How to lower TM overheads – particularly in software? • How to balance granularity / scalability?

Our System • Java Software Transactional Memory (STM) System • Pure software implementation (McRT-STM – PPoPP ’06) • Language extensions in Java (Polyglot) • Integrated with JVM & JIT (ORP & StarJIT) • Novel Features • Rich transactional language constructs in Java • Efficient, first class nested transactions • Complete GC support • Risc-like STM API / IR • Compiler optimizations • Per-type word and object level conflict detection

Transactional Java atomic { S; } Other Language Constructs Built on prior research retry (STM Haskell, …) orelse (STM Haskell) tryatomic (Fortress) when (X10, …) Standard Java + STM API while(true) { TxnHandle th = txnStart(); try { S’; break; } finally { if(!txnCommit(th)) continue; } } Transactional Java → Java

Tight integration with JVM & JIT • StarJIT & ORP • On-demand cloning of methods (Harris ’03) • Identifies transactional regions in Java+STM code • Inserts read/write barriers in transactional code • Maps STM API to first class opcodes in StarJIT IR (STIR) Good compiler representation → greater optimization opportunities

atomic { a.x = t1 a.y = t2 if(a.z == 0) { a.x = 0 a.z = t3 } } … stmWr(&a.x, t1) stmWr(&a.y, t2) if(stmRd(&a.z) != 0) { stmWr(&a.x, 0); stmWr(&a.z, t3) } Representing Read/Write Barriers Traditional barriers hide redundant locking/logging

Redundancies exposed: atomic { a.x = t1 a.y = t2 if(a.z == 0) { a.x = 0 a.z = t3 } } txnOpenForWrite(a) txnLogObjectInt(&a.x, a) a.x = t1 txnOpenForWrite(a) txnLogObjectInt(&a.y, a) a.y = t2 txnOpenForRead(a) if(a.z != 0) { txnOpenForWrite(a) txnLogObjectInt(&a.x, a) a.x = 0 txnOpenForWrite(a) txnLogObjectInt(&a.z, a) a.z = t3 } An STM IR for Optimization

atomic { a.x = t1 a.y = t2 if(a.z == 0) { a.x = 0 a.z = t3 } } txnOpenForWrite(a) txnLogObjectInt(&a.x, a) a.x = t1 txnLogObjectInt(&a.y, a) a.y = t2 if(a.z != 0) { a.x = 0 txnLogObjectInt(&a.z, a) a.y = t3 } Optimized Code Fewer & cheaper STM operations

Compiler Optimizations for Transactions • Standard optimizations • CSE, Dead-code-elimination, … • Careful IR representation exposes opportunities and enables optimizations with almost no modifications • Subtle in presence of nesting • STM-specific optimizations • Immutable field / class detection & barrier removal (vtable/String) • Transaction-local object detection & barrier removal • Partial inlining of STM fast paths to eliminate call overhead

McRT-STM • PPoPP 2006 (Saha, et. al.) • C / C++ STM • Pessimistic Writes: • strict two-phase locking • update in place • undo on abort • Optimistic Reads: • versioning • validation before commit • Benefits • Fast memory accesses (no buffering / object wrapping) • Minimal copying (no cloning for large objects) • Compatible with existing types & libraries Similar STMs: Ennals (FastSTM), Harris, et.al (PLDI ’06)

STM Data Structures • Per-thread: • Transaction Descriptor • Per-thread info for version validation, acquired locks, rollback • Maintained in Read / Write / Undo logs • Transaction Memento • Checkpoint of logs for nesting / partial rollback • Per-data: • Transaction Record • Pointer-sized field guarding a set of shared data • Transactional state of data • Shared: Version number (odd) • Exclusive: Owner’s transaction descriptor (even / aligned)

vtbl vtbl hash TxR x x y y TxR1 TxR2 TxR3 … TxRn Mapping Data to Transaction Record • Every data item has an associated transaction record Transaction record embedded In object class Foo { int x; int y; } Object granularity Object words hash into table of TxRs Hash is f(obj.hash, offset) class Foo { int x; int y; } Word granularity

Object-level Cheaper operation Exposes CSE opportunities Lower overhead on 1P Word-level Reduces false sharing Better scalability Mix & Match Per type basis E.g., word-level for arrays, object-level for non-arrays // Thread 1 a.x = … a.y = … // Thread 2 … = … a.z … Granularity of Conflict Detection

Experiments • 16-way 2.2 GHz Xeon with 16 GB shared memory • L1: 8KB, L2: 512 KB, L3: 2MB, L4: 64MB (per four) • Workloads • Hashtable, Binary tree, OO7 (OODBMS) • Mix of gets, in-place updates, insertions, and removals • Object-level conflict detection by default • Word / mixed where beneficial

Effective of Compiler Optimizations • 1P overheads over thread-unsafe baseline Prior STMs typically incur ~2x on 1P With compiler optimizations: - < 40% over no concurrency control - < 30% over synchronization

Scalability: Java HashMap Shootout • Unsafe (java.util.HashMap) • Thread-unsafe w/o Concurrency Control Synchronized • Coarse-grain synchronization via SynchronizedMap wrapper Concurrent (java.util.concurrent.ConcurrentHashMap) • Multi-year effort: JSR 166 -> Java 5 • Optimized for concurrent gets (no locking) • For updates, divides bucket array into 16 segments (size / locking) Atomic • Transactional version via “AtomicMap” wrapper Atomic Prime • Transactional version with minor hand optimization • Tracks size per segment ala ConcurrentHashMap • Execution • 10,000,000 operations / 200,000 elements • Defaults: load factor, threshold, concurrency level

Scalability: 100% Gets Atomic wrapper is competitive with ConcurrentHashMap Effect of compiler optimizations scale

Scalability: 20% Gets / 80% Updates ConcurrentHashMap thrashes on 16 segments Atomic still scales

20% Inserts and Removes Atomic conflicts on entire bucket array - The array is an object

20% Inserts and Removes: Word-Level We still conflict on the single size field in java.util.HashMap

20% Inserts and Removes: Atomic Prime Atomic Prime tracks size / segment – lowering bottleneck No degradation, modest performance gain

20% Inserts and Removes: Mixed-Level • Mixed-level preserves wins & reduces overheads • word-level for arrays • object-level for non-arrays

Key Takeaways • Optimistic reads + pessimistic writes is nice sweet spot • Compiler optimizations significantly reduce STM overhead • - 20-40% over thread-unsafe • - 10-30% over synchronized • Simple atomic wrappers sometimes good enough • Minor modifications give competitive performance to complex fine-grain synchronization • Word-level contention is crucial for large arrays • Mixed contention provides best of both

Novel Contributions • Rich transactional language constructs in Java • Efficient, first class nested transactions • Complete GC support • Risc-like STM API • Compiler optimizations • Per-type word and object level conflict detection

Compiler and Runtime Support for Efficient Software Transactional Memory