Eliminating Read Barriers through Procrastination and Cleanliness

Eliminating Read Barriers through Procrastination and Cleanliness KC Sivaramakrishnan Lukasz Ziarek Suresh Jagannathan

MultiMLton • Deterministic Parallelism • Effect Isolation • Asynchronous CML (ACML) • Parasitic Threads • GC? • MLton for many-cores • Standard ML – functional PL with side-effects • Goals – Safe and Scalable programs

MultiMLton - Runtime System • User-level threads • Preemptive scheduling • Work-pushing Asynchronous CML Scheduler Substrate SML One-shot continuations Parasitic Threads VProc VProc VProc VProc C

Stop-the-world, Serial GC • MLton GC  MultiMLton GC quickly • Sansom’s “Dual-mode garbage collection” • Dynamically switch between 2-space to 1-space • Cheney’s copying  Jonkers’ sliding mark-compact • No fragmentation • Bump-pointer allocation • Appel’s Generational collection • Adding multicore support • Memory allocated modified for local allocation • GC is still stop-the-world serial

How did we do?

Many-core architectural trends AMD “MagnyCours” 48-cores Tilera Tile64 64-cores Intel SCC 48-cores • Many-core architectural trends • NUMA effects • Cache coherence

Local collector Shared Heap Local Heap Local Heap Local Heap Local Heap VProc VProc VProc VProc

Local collector Shared Heap Local Heap Local Heap Local Heap Local Heap VProc VProc VProc VProc No Synchronization for local allocation/collection! Local collection is Samson’s Dual mode Shared heap is not the nth generation

Thread-local collectors D. Doligez et al. (POPL’93) – SML with threads R. Jones et al. (SCAM '05) – Java B. Steensgaard (ISMM ’00) – subset of Java T. Anderson (ISMM’10) – A variant of MIT’s pH S. Marlow et al. (ISMM’11) –GHC S. Auhagen et al. (MSPC’11)– Manticore

Write Barrier Shared Heap r := x r Target Exporting writes Local Heap Source x

Write Barrier Shared Heap r := x r x Transitive closure of x Local Heap

Write Barrier Shared Heap r := x r x Transitive closure of x Local Heap x

Write Barrier Shared Heap r := x r x Local Heap FWD Mutator needs read barrier Mutations <<< Reads

Read Barrier Overheads 20.1 % 15.3 % 21.3 %

Read Barrier Statistics pointer readBarrier (pointer *p) { if (getHeader(p) == FORWARDED) return *(pointer*)p; return p; } Checks Forwarded

Eliminate read barriers? • No need for read barriers if mutator can never witness forwarded objects • Do a local GC every time you export • Slower than with read barriers • Dynamically ensure mutators never get to see forwarded objects. • Procrastination: Exploit program concurrency to delay exporting writes • Cleanliness: Object closure cleanliness

New idea: Procrastination T1 T2 Shared Heap r1 r2  r1 := x1 r2 := x2 Local Heap x1 x2 T  T is running T  T is suspended T  T is blocked

Procrastination T1 T2 Shared Heap r1 r2 r1 := x1  r2 := x2 Control switches to T2 Local Heap x1 x2 T  T is running Delayed write list  T  T is suspended T  T is blocked

Procrastination T1 T2 Shared Heap r1 r2 r1 := x1 r2 := x2 Local Heap x1 x2 T  T is running Delayed write list  T  T is suspended T  T is blocked

Procrastination T1 T2 Shared Heap r1 x1 r2 x2 r1 := x1 r2 := x2 Local Heap x1 T  T is running Delayed write list  T  T is suspended T  T is blocked

Procrastination T1 T2 Shared Heap r1 x1 r2 x2 r1 := x1  … r2 := x2 Local Heap Force local GC T  T is running Delayed write list  T  T is suspended T  T is blocked

Is Procrastination alone enough? Procrastination depends on availability of Runnable threads @ exporting write Runnable threads << Total threads (Thread Density) Eager exporting writes preserving “mutator never sees forwarding pointers” invariant.

Exporting write characteristics • Sources of exporting writes • Immutable >> Mutable • Tend to be young • References rarely from outside the closure (other than stacks) • Object closure cleanliness • Heap Sessions (Young objects) • Reference counts (Safety of eager export)

Heap Session Local Heap Previous Session Current Session Free SessionStart Frontier • Sessions closed/started after a • User-level thread switch • Exporting write • Local GC

Reference Counting Local heap allocated object Object in current session Count number of references to current session objects Does not consider references from stacks or registers Count is one of ZERO, ONE, LOCAL_MANY, GLOBAL

Cleanliness • An object closure is said to be clean, if for each object O in the closure • O is immutable or is in the shared heap. Or, • O is the root, and has ZERO references. Or, • O is not the root, and has ONE reference. Or, • O is not the root, has LOCAL_MANY references, and is in the current session.

Cleanliness • An object closure is said to be clean, if for each object O in the closure • O is immutable or is in the shared heap. Or, • O is the root, and has ZERO references. Or, • O is not the root, and has ONE reference. Or, • O is not the root, has LOCAL_MANY references, and is in the current session. • Boils down to 2 cases: • Tree-structured closure • Arbitrary Graph

Tree-structured closure

Graph – Session Based Trace current session

Write Barrier 1: ValwriteBarrier (Ref r, Val v) { 2:if(isInSharedHeap (r) && isInLocalHeap (v)) { 3: needsFixup= False; 4:if(isClean(v, &needsFixup)) 5: v = lift(v, needsFixup); //lift eagerly 6:else 7: v = suspendTillGCAndLift (v); //delay write 8: } 9: return v; 10:}

Write Barrier • Summary • Read barrier are expensive in MultiMLton • Eliminate read barriers by avoiding mutator from ever witnessing forwarding pointers 1: ValwriteBarrier (Ref r, Val v) { 2:if(isInSharedHeap (r) && isInLocalHeap (v)) { 3: needsFixup= False; 4:if(isClean(v, &needsFixup)) 5: v = lift(v, needsFixup); //lift eagerly 6:else 7: v = suspendTillGCAndLift (v); //delay write 8: } 9: return v; 10:}

Benchmark Characteristics Lots of concurrency Low sharing

Performance on AMD At 3X: --------- RB+ 32% STW 106% BDW 584%

Performance on AZUL At 3X: --------- RB+ 30%

MultiMLton - SCC implementation Shared heap Local heap • No cache-cache coherence • Cluster-on-chip Architecture • Private off-die DRAM Regions (one per Core) • Caches enabled! One Linux instance per Core! • Local heaps reside here • Shared / Global off-die DRAM Region • Caches disabled per default! • Shared heap resides here • Shared on-die MPB Regions • Cached in L1, L2 Bypass / Fast L1 Invalidation for MPB-Data • Coordinating VProcs

Performance on SCC At 3X: --------- RB+ 20%

Cleanliness Impact (1)

Cleanliness Impact (2) Low thread density

Session Impact

Conclusion • Local collectors seem to be a good choice for many-core architectures • Better Cache Behavior • Minimize NUMA effects • Overcome cache coherence issues (partially) • Read barriers in local collectors can be expensive • Eliminate them through procrastination and cleanliness

Backup slides

MLton Heap Layout From Space (major) Nursery Heap To Space (major) Old Gen To Space (minor) Nursery

MLton GC – Minor Collection To Space (major) Old Gen To Space (minor) Nursery To Space (major) Old Gen To Space (minor) Nursery

MLton GC – Major Copying Collection To Space (major) Old Gen Old Gen To Space (minor) Nursery To Space (major) From Space

MLton GC – Major Mark-Compact Old Gen Free Old Gen To Space (minor) Nursery

Read Barrier Unconditional (Brooks style) From From To To Conditional (Baker Style)

Read Barrier Unconditional (Brooks style) From From F F To To pointer readBarrier (pointer *p) { return *(pointer*)(p – IND_OFF); } pointer readBarrier (pointer *p) { if (*(Header*)(p – HD_OFF) == F) return *(pointer*)p; return p; } Has Conditional Check Needs extra header word Conditional (Baker Style)

Eliminating Read Barriers through Procrastination and Cleanliness

Eliminating Read Barriers through Procrastination and Cleanliness

Presentation Transcript

Procrastination

Procrastination!

Eliminating stigma and removing barriers to access

Procrastination

Eliminating Ethics Through Euphemisms

procrastination

Eliminating Read Barriers through Procrastination and Cleanliness

Procrastination

Procrastination

Procrastination and strategies

Procrastination

Procrastination

Procrastination

Procrastination

Procrastination

PROCRASTINATION .

Procrastination

Supporting Transfer: Eliminating Barriers

Perfectionism and Procrastination

Procrastination

Procrastination

Procrastination