430 likes | 525 Vues
Explore compiler analysis to optimize parallel graph algorithms, focusing on LockSet shape analysis and Boruvka's MST algorithm in Galois system. Discover opportunities to reduce speculation overheads and achieve up to 12x speedup through various optimizations.
E N D
A Shape Analysis for Optimizing Parallel Graph Programs Dimitrios Prountzos1 Keshav Pingali1,2 Roman Manevich2 Kathryn S. McKinley1 1: Department of Computer Science, The University of Texas at Austin 2: Institute for Computational Engineering and Sciences, The University of Texas at Austin
Motivation • Graph algorithms are ubiquitous • Goal: Compiler analysis for optimizationof parallel graph algorithms Computer Graphics Computational biology Social Networks
Organization • Parallelization of graph algorithms in Galois system • Speculative execution • Example: Boruvka MST algorithm • Optimization opportunities • Reduce speculation overheads • Analysis problem: LockSet shape analysis • Lockset shape analysis • Abstract Data Type (ADT) modeling • Hierarchy summarization abstraction • Predicate discovery • Evaluation • Fast and infers all available optimizations • Optimizations give speedup up to 12x
Boruvka’sMinimum Spanning Tree Algorithm 7 1 6 c d e f lt 2 4 4 3 1 6 a b g d e f 5 7 4 4 a,c b g 3 • Build MST bottom-up • repeat { • pick arbitrary node ‘a’ • merge with lightest neighbor ‘lt’ • add edge ‘a-lt’ to MST • } until graph is a single node
Parallelism in Boruvka • Algorithm = repeated application of operator to graph • Active node: • Node where computation is needed • Activity: • Application of operator to active node • Neighborhood: • Sub-graph read/written to perform activity • Unordered algorithms: • Active nodes can be processed in any order • Amorphous data-parallelism • Parallel execution of activities, subject to neighborhood constraints • Neighborhoods are functions of runtime values • Parallelism cannot be uncovered at compile time in general i2 i1 i3
Optimistic Parallelization in Galois • Programming model • Client code has sequential semantics • Library of concurrent data structures • Parallel execution model • Thread-level speculation (TLS) • Activities executed speculatively • Conflict detection • Each node/edge has associated exclusive lock • Graph operations acquire locks on read/written nodes/edges • Lock owned by another thread conflict iteration rolled back • All locks released at the end • Two main overheads • Locking • Undo actions i2 i1 i3
Overheads(I): Locking • Optimizations • Redundant locking elimination • Lock removal for iteration private data • Lock removal for lock domination • ACQ(P): set of definitely acquiredlocks per program point P • Given method call M at P: Locks(M) ACQ(P) Redundant Locking
Overheads (II): Undo actions foreach (Node a : wl) { Set<Node> aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set<Node> ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } } g.removeNode(lt); mst.add(minW); wl.add(a); } foreach (Node a : wl) { … … } Lockset Grows Failsafe Lockset Stable … Program point Pis failsafe if: Q : Reaches(P,Q) Locks(Q) ACQ(P)
Lockset Analysis • Redundant Locking • Locks(M) ACQ(P) • Undo elimination • Q : Reaches(P,Q) Locks(Q) ACQ(P) • Need to compute ACQ(P) GSet<Node> wl = new GSet<Node>(); wl.addAll(g.getNodes()); GBag<Weight> mst = new GBag<Weight>(); foreach (Node a : wl) { Set<Node> aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set<Node> ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } } g.removeNode(lt); mst.add(minW); wl.add(a); } Runtime overhead : Runtime overhead
Analysis Challenges • The usual suspects: • Unbounded Memory Undecidability • Aliasing, Destructive updates • Specific challenges: • Complex ADTs: unstructured graphs • Heap objects are locked • Adapt abstraction to ADTs • We use Abstract Interpretation [CC’77] • Balance precision and realistic performance
Organization • Parallelization of graph algorithms in Galois system • Speculative execution • Example: Boruvka MST algorithm • Optimization opportunities • Reduce speculation overheads • Analysis problem: LockSet shape analysis • Lockset shape analysis • Abstract Data Type (ADT) modeling • Hierarchy summarization abstraction • Predicate discovery • Evaluation • Fast and infers all available optimizations • Optimizations give speedup up to 12x
Shape Analysis Overview Predicate Discovery Graph { @rep nodes @rep edges … } Set { @rep cont … } … … HashMap-Graph Set Spec Shape Analysis Graph Spec Tree-based Set Concrete ADT Implementations in Galois library ADTSpecifications Boruvka.java Optimized Boruvka.java
ADT Specification Abstract ADT state by virtual set fields Graph<ND,ED> { @rep set<Node> nodes @rep set<Edge> edges Set<Node> neighbors(Node n); } ... Set<Node> S1 = g.neighbors(n); ... @locks(n + n.rev(src) + n.rev(src).dst + n.rev(dst) + n.rev(dst).src) @op( nghbrs = n.rev(src).dst + n.rev(dst).src , ret = new Set<Node<ND>>(cont=nghbrs) ) Boruvka.java Graph Spec Assumption: Implementation satisfies Spec
Modeling ADTs Graph<ND,ED> { @rep set<Node> nodes @rep set<Edge> edges @locks(n + n.rev(src) + n.rev(src).dst+n.rev(dst) + n.rev(dst).src) @op( nghbrs= n.rev(src).dst+ n.rev(dst).src , ret = new Set<Node<ND>>(cont=nghbrs) ) Set<Node> neighbors(Node n); } c src dst src dst dst src a b Graph Spec
Modeling ADTs Abstract State Graph<ND,ED> { @rep set<Node> nodes @rep set<Edge> edges @locks(n + n.rev(src) + n.rev(src).dst+n.rev(dst) + n.rev(dst).src) @op( nghbrs= n.rev(src).dst+ n.rev(dst).src , ret = new Set<Node<ND>>(cont=nghbrs) ) Set<Node> neighbors(Node n); } nodes edges c src dst ret cont src dst dst src a b Graph Spec nghbrs
Organization • Parallelization of graph algorithms in Galois system • Speculative execution • Example: Boruvka MST algorithm • Optimization opportunities • Reduce speculation overheads • Analysis problem: LockSet shape analysis • Lockset shape analysis • Abstract Data Type (ADT) modeling • Hierarchy summarization abstraction • Predicate discovery • Evaluation • Fast and infers all available optimizations • Optimizations give speedup up to 12x
Abstraction Scheme • Parameterized by set of LockPaths: L(Path)o . o ∊ Path Locked(o) • Tracks subset of must-be-locked objects • Abstract domain elements have the form: Aliasing-configs 2LockPaths … S1 S1 S2 S2 L(S2.cont) L(S2.cont) L(S1.cont) L(S1.cont) (S1 ≠S2) ∧ L(S1.cont) ∧ L(S2.cont) cont cont cont cont
Joining Abstract States ( L(y.nd) ) ( L(y.nd) L(x.rev(src)) ) ( () L(x.nd) ) ( L(y.nd) ) ( () L(x.nd) ) ( L(y.nd) ) ( () L(x.nd) ) Aliasing is crucial for precision May-be-locked does not enable our optimizations #Aliasing-configs: small constant (6)
Example Invariant in Boruvka GSet<Node> wl = new GSet<Node>(); wl.addAll(g.getNodes()); GBag<Weight> mst = new GBag<Weight>(); foreach (Node a : wl) { Set<Node> aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set<Node> ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } } g.removeNode(lt); mst.add(minW); wl.add(a); } • The immediate neighbors • of a and lt are locked lt a ( a ≠ lt ) ∧ L(a) ∧ L(a.rev(src)) ∧ L(a.rev(dst)) ∧ L(a.rev(src).dst)∧ L(a.rev(dst).src) ∧ L(lt) ∧ L(lt.rev(dst)) ∧ L(lt.rev(src)) ∧ L(lt.rev(dst).src) ∧ L(lt.rev(src).dst) …..
Heuristics for Finding Paths S • Hierarchy Summarization (HS) • x.( fld )* • Type hierarchy graph acyclic bounded number of paths • Preflow-Push: • L(S.cont) ∧ L(S.cont.nd) • Nodes in set S and their data are locked Set<Node> cont Node nd NodeData
Footprint Graph Heuristic • Footprint Graphs (FG)[Calcagno et al. SAS’07] • All acyclic paths from arguments of ADT method to locked objects • x.( fld | rev(fld) )* • Delaunay Refinement: L(S.cont) ∧ L(S.cont.rev(src)) ∧ L(S.cont.rev(dst)) ∧ L(S.cont.rev(src).dst) ∧ L(S.cont.rev(dst).src) • Nodes in set S and all of their immediate neighbors are locked • Composition of HS, FG • Preflow-Push: L(a.rev(src).ed) FG HS
Organization • Parallelization of graph algorithms in Galois system • Speculative execution • Example: Boruvka MST algorithm • Optimization opportunities • Reduce speculation overheads • Analysis problem: LockSet shape analysis • Shape analysis • Abstract Data Type modeling • Hierarchy summarization abstraction • Predicate discovery • Evaluation • Fast and infers all available optimizations • Optimizations give speedup up to 12x
Experimental Evaluation • Implement on top of TVLA • Encode abstraction by 3-Valued Shape Analysis [SRW TOPLAS’02] • Evaluation on 4 Lonestar Java benchmarks • Inferred all available optimizations • # abstract states practically linear in program size
Impact of Optimizations for 8 Threads 8-core Intel Xeon @ 3.00 GHz
Related Work • Safe programmable speculative parallelism [Prabhu et al. PLDI’10] • Focused on value speculation on ordered algorithms • Different rollback freedom condition • Transactional Memory compiler optimizations [Harris et al. PLDI’06, Dragojevic et al. SPAA’09] • Similar optimizations • Don’t target rollback freedom • Imprecise for unbounded data-structures • Optimizations for parallel graph programs [Mendez-Lojo et al. PPOPP’10] • Manual optimizations • Failsafe subsumes cautious • Verifying conformance of ADT implementation to specification • The Jahob project (Kuncak, Rinard, Wies et al.)
Conclusion • New application for static analysis • Optimization of optimistically parallelized graph programs • Novel shape analysis • Utilize observations on the structure of concrete states and programming style • Enables optimizations crucial for performance
Outline of Boruvka MST Code GSet<Node> wl = new GSet<Node>(); wl.addAll(g.getNodes()); GBag<Weight> mst = new GBag<Weight>(); foreach (Node a : wl) { Set<Node> aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set<Node> ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } } g.removeNode(lt); mst.add(minW); wl.add(a); } Pick arbitrary worklist node Find lightest neighbor Update neighbors of lightest Update worklist and MST
Approximating Sets of Locked Objects Reachability-based Scheme S1 S2 S1 S2 cont cont cont cont Reach(S1.cont) Reach(S2.cont) Reach(S1.cont) Reach(S2.cont) Reach(S2.cont) Reach(S2.cont) Reach(S2.cont)
Approximating Sets of Locked Objects Hierarchy Summarization Scheme S1 S2 (S1 ≠S2) ∧ L(S1.cont) ∧ L(S2.cont) S1 S2 cont cont L(S2.cont) L(S1.cont) cont cont L(Path) o . o ∊ Path Locked(o) L(S2.cont) L(S1.cont)
Enabling Optimizations in Galois • Graph API uses flags to enable/disable locking and storing undo actions • removeEdge(Node src, Node dst, Flag f); ALL LOCKS UNDO Lock(src) Lock(dst) Lock( (src,dst) ) addEdge(src, dst); NONE Challenge: Find minimal flag per ADT method call Solution: Lockset analysis
Optimization Conditions • set of definitely acquiredlocks per program point • Given method call at : • Program point is failsafe if: • Infer from by simple backward analysis
Modeling ADTs Abstract State Graph<ND,ED> { @rep set<Node> ns; // nodes @rep set<Edge> es; // edges @locks(n + n.rev(src) + n.rev(src).dst+n.rev(dst) + n.rev(dst).src) @op( nghbrs= n.rev(src).dst+ n.rev(dst).src , ret = new Set<Node<ND>>(cont=nghbrs) ) Set<Node> neighbors(Node n); } ret ns es nghbrs a b cont 7 5 c d 3 dst src 2 4 ed a b Graph Spec 5
Failsafe Points – Eliminating Undo Actions a lt CAUTIOUS OPERATOR [Mendez et al. PPOPP’10] g.neighbors(lt); GSet<Node> wl = new GSet<Node>(); wl.addAll(g.getNodes()); GBag<Weight> mst = new GBag<Weight>(); foreach (Node a : wl) { Set<Node> aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set<Node> ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } } g.removeNode(lt); mst.add(minW); wl.add(a); } 7 c d 3 2 4 a b 5 Graph Node Graph Edge Edge Data
Boruvka’sMinimum Spanning Tree Algorithm 7 1 6 c d e f 7 2 4 4 3 a b g 3 5 • Build MST bottom up • repeat { • pick arbitrary active node ‘a’ • merge with lightest neighbor ‘lt’ • add edge ‘a-lt’ to MST • } until graph is a singular node
Failsafe Points – Eliminating Undo Actions a lt CAUTIOUS OPERATOR [Mendez et al. PPOPP’10] g.neighbors(lt); GSet<Node> wl = new GSet<Node>(); wl.addAll(g.getNodes()); GBag<Weight> mst = new GBag<Weight>(); foreach (Node a : wl) { Set<Node> aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set<Node> ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } } g.removeNode(lt); mst.add(minW); wl.add(a); } Graph Node 7 Graph Edge c d Edge Data 3 2 4 a b 5 New Lock Redundant Acquired
lt Redundant Locking Example a GSet<Node> wl = new GSet<Node>(); wl.addAll(g.getNodes()); GBag<Weight> mst = new GBag<Weight>(); foreach (Node a : wl) { Set<Node> aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set<Node> ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } } g.removeNode(lt); mst.add(minW); wl.add(a); } Graph Node 7 Graph Edge c d Edge Data 3 2 4 a b 5 New Lock Redundant Acquired
Boruvka’sMinimum Spanning Tree Algorithm GSet<Node> wl = new GSet<Node>(); wl.addAll(g.getNodes()); GBag<Weight> mst = new GBag<Weight>(); foreach (Node a : wl) { Set<Node> aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set<Node> ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } } g.removeNode(lt); mst.add(minW); wl.add(a); } 7 1 6 c d e f 7 2 4 4 3 a b g 3 5 • Build MST iteratively • Pick random active node • Contract edge with lightest neighbor
Parallelism in Boruvka’s Algorithm GSet<Node> wl = new GSet<Node>(); wl.addAll(g.getNodes()); GBag<Weight> mst = new GBag<Weight>(); foreach (Node a : wl) { Set<Node> aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set<Node> ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } } g.removeNode(lt); mst.add(minW); wl.add(a); } 7 1 6 c d e f f 3 2 4 4 a b g 5 • Dependences between activities are functions of runtime values • Parallelism cannot be uncovered at compile time in general • Don’t Care Non-Determinism • All produced MSTs correct and optimal
Abstraction of a Single State State after first loop in Boruvka ( a lt ) ∧ L(a) ∧ L(a.rev(src)) ∧ L(a.rev(dst)) ∧ L(a.rev(src).dst) ∧ L(a.rev(dst).src) ∧ UniqueRef(EdgeData) 7 lt c d 3 2 4 a a b 5
Hierarchy Summarization Intuition aNghbrsltNghbrs mst nIter wl g Iterator<Node> Set<Node> Gset<Node> GBag<Weight> Graph all ns cont gcont es bcont past, at,future Edge e,an src,dst Node a, lt, n ed Weight w,wan,minW nd Void