Compiler and Runtime Support for Efficient Software Transactional Memory
This paper discusses the implementation of Java Software Transactional Memory (STM) to improve the performance and programmability of concurrent software. It highlights the challenges of locking mechanisms and presents STM as an appealing alternative with simpler programming models and stronger guarantees such as atomicity, consistency, and isolation. The proposed STM system integrates with the Java Virtual Machine (JVM) and Just-In-Time (JIT) compilation, featuring efficient language constructs, compiler optimizations, and transaction management. It also examines the balance between granularity and scalability in achieving efficient transactional operations.
Compiler and Runtime Support for Efficient Software Transactional Memory
E N D
Presentation Transcript
Compiler and Runtime Supportfor EfficientSoftware Transactional Memory Vijay Menon Programming Systems Lab Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian R. Murphy, Bratin Saha, Tatiana Shpeisman
Motivation Locks are hard to get right • Programmability vs scalability Transactional memory is appealing alternative • Simpler programming model • Stronger guarantees • Atomicity, Consistency, Isolation • Deadlock avoidance • Closer to programmer intent • Scalable implementations Questions • How to lower TM overheads – particularly in software? • How to balance granularity / scalability?
Our System • Java Software Transactional Memory (STM) System • Pure software implementation (McRT-STM – PPoPP ’06) • Language extensions in Java (Polyglot) • Integrated with JVM & JIT (ORP & StarJIT) • Novel Features • Rich transactional language constructs in Java • Efficient, first class nested transactions • Complete GC support • Risc-like STM API / IR • Compiler optimizations • Per-type word and object level conflict detection
Transactional Java atomic { S; } Other Language Constructs Built on prior research retry (STM Haskell, …) orelse (STM Haskell) tryatomic (Fortress) when (X10, …) Standard Java + STM API while(true) { TxnHandle th = txnStart(); try { S’; break; } finally { if(!txnCommit(th)) continue; } } Transactional Java → Java
Tight integration with JVM & JIT • StarJIT & ORP • On-demand cloning of methods (Harris ’03) • Identifies transactional regions in Java+STM code • Inserts read/write barriers in transactional code • Maps STM API to first class opcodes in StarJIT IR (STIR) Good compiler representation → greater optimization opportunities
atomic { a.x = t1 a.y = t2 if(a.z == 0) { a.x = 0 a.z = t3 } } … stmWr(&a.x, t1) stmWr(&a.y, t2) if(stmRd(&a.z) != 0) { stmWr(&a.x, 0); stmWr(&a.z, t3) } Representing Read/Write Barriers Traditional barriers hide redundant locking/logging
Redundancies exposed: atomic { a.x = t1 a.y = t2 if(a.z == 0) { a.x = 0 a.z = t3 } } txnOpenForWrite(a) txnLogObjectInt(&a.x, a) a.x = t1 txnOpenForWrite(a) txnLogObjectInt(&a.y, a) a.y = t2 txnOpenForRead(a) if(a.z != 0) { txnOpenForWrite(a) txnLogObjectInt(&a.x, a) a.x = 0 txnOpenForWrite(a) txnLogObjectInt(&a.z, a) a.z = t3 } An STM IR for Optimization
atomic { a.x = t1 a.y = t2 if(a.z == 0) { a.x = 0 a.z = t3 } } txnOpenForWrite(a) txnLogObjectInt(&a.x, a) a.x = t1 txnLogObjectInt(&a.y, a) a.y = t2 if(a.z != 0) { a.x = 0 txnLogObjectInt(&a.z, a) a.y = t3 } Optimized Code Fewer & cheaper STM operations
Compiler Optimizations for Transactions • Standard optimizations • CSE, Dead-code-elimination, … • Careful IR representation exposes opportunities and enables optimizations with almost no modifications • Subtle in presence of nesting • STM-specific optimizations • Immutable field / class detection & barrier removal (vtable/String) • Transaction-local object detection & barrier removal • Partial inlining of STM fast paths to eliminate call overhead
McRT-STM • PPoPP 2006 (Saha, et. al.) • C / C++ STM • Pessimistic Writes: • strict two-phase locking • update in place • undo on abort • Optimistic Reads: • versioning • validation before commit • Benefits • Fast memory accesses (no buffering / object wrapping) • Minimal copying (no cloning for large objects) • Compatible with existing types & libraries Similar STMs: Ennals (FastSTM), Harris, et.al (PLDI ’06)
STM Data Structures • Per-thread: • Transaction Descriptor • Per-thread info for version validation, acquired locks, rollback • Maintained in Read / Write / Undo logs • Transaction Memento • Checkpoint of logs for nesting / partial rollback • Per-data: • Transaction Record • Pointer-sized field guarding a set of shared data • Transactional state of data • Shared: Version number (odd) • Exclusive: Owner’s transaction descriptor (even / aligned)
vtbl vtbl hash TxR x x y y TxR1 TxR2 TxR3 … TxRn Mapping Data to Transaction Record • Every data item has an associated transaction record Transaction record embedded In object class Foo { int x; int y; } Object granularity Object words hash into table of TxRs Hash is f(obj.hash, offset) class Foo { int x; int y; } Word granularity
Object-level Cheaper operation Exposes CSE opportunities Lower overhead on 1P Word-level Reduces false sharing Better scalability Mix & Match Per type basis E.g., word-level for arrays, object-level for non-arrays // Thread 1 a.x = … a.y = … // Thread 2 … = … a.z … Granularity of Conflict Detection
Experiments • 16-way 2.2 GHz Xeon with 16 GB shared memory • L1: 8KB, L2: 512 KB, L3: 2MB, L4: 64MB (per four) • Workloads • Hashtable, Binary tree, OO7 (OODBMS) • Mix of gets, in-place updates, insertions, and removals • Object-level conflict detection by default • Word / mixed where beneficial
Effective of Compiler Optimizations • 1P overheads over thread-unsafe baseline Prior STMs typically incur ~2x on 1P With compiler optimizations: - < 40% over no concurrency control - < 30% over synchronization
Scalability: Java HashMap Shootout • Unsafe (java.util.HashMap) • Thread-unsafe w/o Concurrency Control Synchronized • Coarse-grain synchronization via SynchronizedMap wrapper Concurrent (java.util.concurrent.ConcurrentHashMap) • Multi-year effort: JSR 166 -> Java 5 • Optimized for concurrent gets (no locking) • For updates, divides bucket array into 16 segments (size / locking) Atomic • Transactional version via “AtomicMap” wrapper Atomic Prime • Transactional version with minor hand optimization • Tracks size per segment ala ConcurrentHashMap • Execution • 10,000,000 operations / 200,000 elements • Defaults: load factor, threshold, concurrency level
Scalability: 100% Gets Atomic wrapper is competitive with ConcurrentHashMap Effect of compiler optimizations scale
Scalability: 20% Gets / 80% Updates ConcurrentHashMap thrashes on 16 segments Atomic still scales
20% Inserts and Removes Atomic conflicts on entire bucket array - The array is an object
20% Inserts and Removes: Word-Level We still conflict on the single size field in java.util.HashMap
20% Inserts and Removes: Atomic Prime Atomic Prime tracks size / segment – lowering bottleneck No degradation, modest performance gain
20% Inserts and Removes: Mixed-Level • Mixed-level preserves wins & reduces overheads • word-level for arrays • object-level for non-arrays
Key Takeaways • Optimistic reads + pessimistic writes is nice sweet spot • Compiler optimizations significantly reduce STM overhead • - 20-40% over thread-unsafe • - 10-30% over synchronized • Simple atomic wrappers sometimes good enough • Minor modifications give competitive performance to complex fine-grain synchronization • Word-level contention is crucial for large arrays • Mixed contention provides best of both
Novel Contributions • Rich transactional language constructs in Java • Efficient, first class nested transactions • Complete GC support • Risc-like STM API • Compiler optimizations • Per-type word and object level conflict detection