Fence Scoping
E N D
Presentation Transcript
Fence Scoping Changhui Lin†, Vijay Nagarajan*, Rajiv Gupta† † University of California, Riverside * University of Edinburgh
Reordering in Uniprocessors • Memory operations are reordered to improve performance • Hardware (e.g., store buffer, reorder buffer) • Compiler (e.g., code motion, caching value in register) • No harm as long as dependences are respected a1: St x a2: Ld y a2: Ld y a1: St x
Reordering in Multiprocessors • counter-intuitive program behavior Initially x=y=0 a1: x = 1; b1: Ry = y; b2: Rx = x; b1: Ry = y; P1P2 a2: y = 1; a1: x = 1; b1: Ry = y; b2: Rx = x; b1: Ry = y; b2: Rx = x; a1: x = 1; b1: Ry = y; a1: x = 1; b2: Rx = x; a2: y = 1; b2: Rx = x; a2: y = 1; a2: y = 1; Intuitively, y=1 x=1 Ry=1 Rx=1 (Rx=0, Ry =0) a1: x = 1; (Rx=0, Ry =1) (Rx=1, Ry =0) a2: y = 1; (Rx=1, Ry =1)
Reordering in Multiprocessors • counter-intuitive program behavior Initially p=NULL, flag = false P1P2 p = new A(…) if (flag) a = p->var; flag = true; flag is supposed to be set after p is allocated
Fence Instructions • Memory Consistency Models • Specify what reordering is allowed • e.g., SC, TSO (x86, SPARC), RMO (ARM, PowerPC) • Fence Instructions (Fences/Memory barriers) • Selectively override default relaxed memory order • Order memory operations before and after the fence P1 p = new A(…) FENCE flag = true;
Fence Instructions • Memory Consistency Models • Specify what reordering is allowed • e.g., SC, TSO (x86, SPARC), RMO (ARM, PowerPC) • Fence Instructions (Fences/Memory barriers) • Selectively override default relaxed memory order • Order memory operations before and after the fence • Inevitable -- building concurrent implementations (e.g., mutual exclusion, queues) [Attiya et. al., POPL’11] • Expensive -- Cilk-5’s THE protocol spends 50% of its time executing a memory fence [Frigo et. al., PLDI’98]
Motivation • Not all memory orderings enforced by fences are necessary • Fences are usually used to enforce some specific memory operations • Programmers know better how a fence is used, which can be conveyed to the hardware Control Data Access Concurrent algorithm Process Data
Scoped Fence (S-Fence) • A S-Fence only orders memory operations in the scope • Scope definition (Class scope, Set scope) • Bridge the gap between programmers’ intention and hardware execution • Programmers specify the scope • Scope information is conveyed to hardware, imposing fewer ordering constraints • Lightweight hardware and compiler support
Scoped Fence (S-Fence) • Programming support S-FENCE global scope S-FENCE[class] class scope S-FENCE[set, {var1, var2, …}] set scope
Work-Stealing Queue Algorithm • TASK take ( ){ • tail = TAIL – 1; • TAIL = tail; • FENCE // store-load • head = HEAD; • if (tail<head){ • TAIL = head; • return EMPTY; • } • … … • return task • } • void put (TASK task){ • tail = TAIL; • wsq[tail] = task; • FENCE // store-store • TAIL = tail+1; • } • TASKsteal ( ){ • head = HEAD; • tail = TAIL; • … … • return task; • } Chase-Lev lock-free concurrent work-stealing queue
Parallel Spanning Tree • tail = TAIL – 1; • TAIL = tail; • FENCE • head = HEAD; • …… • color[task’] = label; • parent[task’] = task; • tail = TAIL; • wsq[tail] = task’; • FENCE • TAIL = tail + 1; ① FENCE • task = wsq.take(); • for (each neighbor task’ of task) • if (task’ is not processed){ • process(task’); • wsq.put(task’) ; • } ② ③ FENCE (a) (b)
Class Scope • S-FENCE[class] class scope • Make use of class in OO languages to illustrate the concept • Constrain a fence to the object class where it is used (Encapsulation) • Intuition: function members operate on data members of the class
Class Scope • S-FENCE[class] class scope class B { int n1, n2; void funcB() { n1 = val3; S-FENCE2[class] n2 = val4; } } class A { B b; int m1, m2; void funcA() { m1 = val1; b.funcB(); S-FENCE1[class] m2 = val2; } } S-FENCE1: m1, m2, n1, n2 S-FENCE2: n1, n2
Class Scope Semantics More details in paper
Parallel Spanning Tree • tail = TAIL – 1; • TAIL = tail; • FENCE • head = HEAD; • …… • color[task’] = label; • parent[task’] = task; • tail = TAIL; • wsq[tail] = task’; • FENCE • TAIL = tail + 1; ① SFENCE[class] • task = wsq.take(); • for (each neighbor task’ of task) • if (task’ is not processed){ • process(task’); • wsq.put(task’) ; • } ② ③ SFENCE[class] (a) (b)
Compiler Support • ISA Extension • class-fence • fs_start – start of a fence scope • fs_end – end of a fence scope Use fs_start and fs_end to embrace functions containing fences • Informing hardware to mark memory operations properly
Hardware Support Reorder Buffer Store Buffer ... ... • Fence Scope Bits (FSB) • Each entry of ROB and store buffer is associated with FSB • Flag whether a memory operation is in the scope of some fence Fence Scope Bits (FSB) • Decoding - memory operations in the scope are marked via FSB • Fence issue - check the entry for current scope
Hardware Support Reorder Buffer Store Buffer ... ... • Fence Scope Bits (FSB) • Each entry of ROB and store buffer is associated with FSB • Flag whether a memory operation is in the scope of some fence Fence Scope Bits (FSB) • Decoding - memory operations in the scope are marked via FSB • Fence issue - check the entry for current scope
Hardware Support • Setting Fence Bits • FSS: stack to record scope FSB fs_start a I0 I1 I2 I3 I4 I5 I6 I7 fs_start b outer inner fs_end b fs_end a 0 1 2 3
Hardware Support • Setting Fence Bits • FSS: stack to record scope FSB fs_start a I0 I1 I2 I3 I4 I5 I6 I7 fs_start b outer inner fs_end b fs_end a 0 1 2 3
Hardware Support • Setting Fence Bits • FSS: stack to record scope FSB fs_start a I0 I1 I2 I3 I4 I5 I6 I7 Issue Fence fs_start b • by checking FSB on the current scope outer inner fs_end b fs_end a 0 1 2 3
Hardware Support • Setting Fence Bits • FSS: stack to record scope FSB fs_start a I0 I1 I2 I3 I4 I5 I6 I7 Issue Fence fs_start b • by checking FSB on the current scope outer inner fs_end b fs_end a 0 1 2 3
Why S-Fence performs Better? St A St A St A St A St X St X Store Buffer drained & Fence issued stall stall stall Traditional Fence ...... SB Ld Y ROB St B 0 1 2 3 4 St A St X Timeline FENCE stall St A : a cache miss Scoped Fence Ld Y SB St B Ld Y ROB St B
Set Scope • Dekker algorithm Initially flag1 = flag2 = 0 P1P2 m1 = … m2 = … flag1 = 1; flag2 = 1; if (flag2 == 0) if (flag1 == 0) critical sectioncritical section FENCE FENCE
Set Scope • Dekker algorithm Initially flag1 = flag2 = 0 P1P2 m1 = … m2 = … flag1 = 1; flag2 = 1; if (flag2 == 0) if (flag1 == 0) critical sectioncritical section S-FENCE … S-FENCE[set, {flag1, flag2}]
Set Scope • S-FENCE[set, {var1, var2, …}] set scope • only order memory accesses to {var1, var2, …} • Compiler and Hardware Supports • flag memory accesses to the specified variables • set fence scope bits in hardware for flagged memory accesses • For simplicity, we do not differentiate memory accesses to different sets
Experimental Evaluation • Cycle-accurate simulation (SESC) • Integrate scoped fence logic • RMO memory model • Benchmarks • pst - parallel spanning tree (work-stealing queue, class scope) • ptc – parallel transitive closure (work-stealing queue, class scope) • barnes – from SPLASH2 (fences inserted for SC, set scope) • radiosity – from SPLASH2 (fences inserted for SC, set scope)
Experimental Evaluation Traditional fence (T) vs. Scoped fence (S) set scope class scope ~13% Fence Stall Reduced ~50% ~40-50%
Conclusion • Introduce the concept of fence scope • Propose class scope and set scope • OpenCL 2.0 (sub-group, work-group, device, system) • Lightweight compiler and hardware support • No change in inter-processor communication Fence scope should be implemented in some form !
Fence Scoping Changhui Lin†, Vijay Nagarajan*, Rajiv Gupta† † University of California, Riverside * University of Edinburgh