Problem Context #1: The design of efficient multiprocessors

A Unified Framework for Constraint Based Shared Memory Consistency Analysisa presentation in CP+CV’04Yue Yang, Ganesh Gopalakrishnan, Gary Lindstrom, Konrad Slind School of ComputingUniversity of UtahSupported in part by SRC Contract 1031.001 and NSF Grants CCR-0081406, 0219805

Problem Context #1: The design of efficient multiprocessors Efficient Multiprocessors have Efficient Shared Memory Systems ... because CPUs grow faster faster than memory systems

Efficient Shared-memory Multiprocessor Systems employ • Weak memory models • Controlled ways to postpone global view updates . Advanced consistency protocols, advanced OS libraries, ... depend on it Problem: How to specify these weak memory models? How to use the specification in practice to support verification activities?

Problem Context #2: The design of multithreaded software Language-level weak memory models are being studied ... because languages with explicit threading cannot be implemented efficiently on a wide variety of platforms Examples: Java C# OpenMP Re-entrant device drivers OS code that runs very fast with minimal locking ...

Characteristics of language-level memory models • Weak memory models • Controlled ways to postpone global view updates • Encompass compiler optimizations that “make sense” Problem: How to specify language-level memory models? How to use the specification in practice to support verification activities?

Answer: Constraints seem very attractive! • Constraint-based specification of several architecture-level memory models • The use of these specifications to enable formal comparisons among the models • The use of these specs for post-silicon verification (work in collaboration with Intel) (FOCUS OF THIS TALK) • Analysis of proposals for language-level memory models (Yang’s dissertation) • The use of specs of language-level memory models for • Memory-model sensitive program analysis (preliminary work in Yang’s dissertation) • ... for certifying compilers that exploit language-level memory models (future work)

Publications • Yang, Y., Gopalakrishnan, G., Lindstrom, G., and Slind, K., “Analyzing the Intel Itanium Memory Memory Ordering Rulesusing Logic Programming and SAT,” Charme 2003, LNCS 2860, October 2003. • Yang, Y., Gopalakrishnan, G., Lindstrom, G., and Slind, K., “Nemos: A framework for axiomatic and executable specifications of memory consistencymodels,” IPDPS 2004, Santa Fe, NM, April 2004. • Yang, Y., Gopalakrishnan, G., and Lindstrom, G., “A Constraint Based Approach for Specifying Memory Consistency Models,” Journal of Logic Programming (submitted to a special issue on constraints) • Gopalakrishnan, G., Yang, Y., and Hemanthkumar, S., “QB or not QB: An efficient verification tool for memory orderings,” Accepted by CAV’04

Proof of concept Software • Nemos (Yue Yang) • Defines memory models via Constraint Logic Programs • Constraint-Prolog code available for experimentation • DefectFinder (Yue Yang) • Constraint-Prolog code that models race and atomicity analysis • works in the small (no loops at present) • underlying shared memory model given as an explicit parameter • QBF-based Memory Order Checker • written by the PI in Ocaml • Compiles memory order rules written in HOL to QBF • Does not scale yet • Might provide benchmarks to tune QBF-tools • SAT-based Memory Order Checking Tool • Present version written by the PI in Ocaml • Next version expected to be formally derived • Replaces Constraint-Prolog version for Itanium • One “real” execution given by Intel was successfully run

Another effort involving constraints (presented at DCC’04): Limited Observability Run-time Verification (with Ching Tsun Chou)

The rest of the talk: Post-Si verification of MP Orderings using Constraints

Weak memory models allow multiple executions... st c,1 ; st d,2 ld d; ld c CPU CPU Memory One possible execution... st c,1 ; st d,2 ld d, 2; ld c, 0 Impossible under SC Possible under Itanium Another execution... st c,1 ; st d,2 ld d, 2; ld c, 1 Possible under SC and under Itanium

Commercial Weak Memory Model Specs are Complex • Intel’s original specification was pretty voluminous • They later issued a formal spec that clarified many things • Yet, their “formal” spec left many things informal • It was purely “on paper” (no machine-readable formal spec) • Our exercise was to take Intel’s semi-formal spec and capture it in HOL • Result: 36 pages of Intel’s formal spec in 3 pages of HOL spec • Can prove “challenge theorems” (future work)

Basic idea behind Intel’s Formal Spec (which we follow in our formal spec) Make it look like SC so that people have less trouble understanding! legalItanium(ops) = Existsorder. ( requireStrictTotalOrder ops order /\ requireWriteOperationOrder ops order /\ requireItProgramOrder ops order /\ requireMemoryDataDependence ops order /\ requireDataFlowDependence ops order /\ requireCoherence ops order /\ requireAtomicWBRelease ops order /\ requireSequentialUC ops order /\ requireNoUCBypass ops order /\ requireReadValue ops order SC(ops) = Existsorder. ( requireStrictTotalOrder ops order /\ requireProgramOrder ops order /\ requireReadValue ops order Call it “otherOrder”

But, how do we check executions against such specs? legalItanium(ops) = Existsorder. ( requireStrictTotalOrder ops order /\ requireWriteOperationOrder ops order /\ requireItProgramOrder ops order /\ requireMemoryDataDependence ops order /\ requireDataFlowDependence ops order /\ requireCoherence ops order /\ requireAtomicWBRelease ops order /\ requireSequentialUC ops order /\ requireNoUCBypass ops order /\ requireReadValue ops order SC(ops) = Existsorder. ( requireStrictTotalOrder ops order /\ requireProgramOrder ops order /\ requireReadValue ops order Execution 1 Execution 2 st c,1 ; st d,2 ld d, 2; ld c, 0 st c,1 ; st d,2 ld d, 2; ld c, 1 e.g., which execution is legal under which memory model ?

Why care about Execution Validation? • In complex systems, FV helps eliminate (most) bugs • Must verify final silicon also (as far as possible) • (Note: this is different from “fabrication fault” testing) • FV can help immensely during Post-Silicon Verification !! • - This is like “runtime verification” ala. Havelund, Rosu, Lee, ...

Post-Si verification of MP Orderings today (oversimplified) assembly program 1 assembly program n Run repeatedly to catch one interleaving that might reveal bug ... New MP System ... Check every execution against ordering rules for compliance assembly execution 1 assembly execution n * This is done ad-hoc * How to make this formal and efficient ? * How to capitalize on repeated re-runs ?

Initial approach tried and abandoned... requireProgramOrder ops order = Forall i,j : ops ( orderedByAcquire i j \/ orderedByRelease i j \/ orderedByFence i j ) ==> order i j ( % Rule (ACQ): ACQ>>I ..... #\/ % Rule (REL): Op_j #= StRel #/\ ( IsWr_i #==> (WrType_i #= Local #/\ WrType_j #= Local #\/ WrType_i #= Remote #/\ WrType_j #= Remote #/\ WrProc_i #= WrProc_j) ) .... #==> Oij. IMPOSES CONSTRAINT ON MATRIX ENTRY Oij

Our new SAT-based execution formal verification method Execution legalItanium(ops) = Existsorder. ( requireStrictTotalOrder ops order /\ requireWriteOperationOrder ops order /\ requireItProgramOrder ops order /\ requireMemoryDataDependence ops order /\ requireDataFlowDependence ops order /\ requireCoherence ops order /\ requireAtomicWBRelease ops order /\ requireSequentialUC ops order /\ requireNoUCBypass ops order /\ requireReadValue ops order st c,1 ; st d,2 ld d, 2; ld c, 1 Hand-derivation now... ...to be automated Program capturing memory ordering rules SAT instance Sat Solver Find out which instructions violated what ordering rules... Extract Unsat core (currently done using Zcore) SAT UNSAT Explanation (How things may bypass each other...)

Have built tool for tuple-generation that addresses many details: (1) Expansion into tuples with variable address allocation P1: St a,1; Ld r1,a <1>; St b,r1 <1>; P2: Ld.acq r2,b <1>; Ld r3,a <0>; Tuple 1 {id=0; proc=0; pc=0; op= St; var=0; data=1; wrID=0; wrType=Local; wrProc=0; reg=-1; useReg=false}; {id=1; proc=0; pc=0; op= St; var=0; data=1; wrID=0; wrType=Remote; wrProc=0; reg=-1; useReg=false}; {id=2; proc=0; pc=0; op= St; var=0; data=1; wrID=0; wrType=Remote; wrProc=1; reg=-1; useReg=false}; {id=3; proc=0; pc=1; op= Ld; var=0; data=1; wrID=-1; wrType=DontCare; wrProc=-1; reg=0; useReg=true}; {id=4; proc=0; pc=2; op= St; var=1; data=1; wrID=4; wrType=Local; wrProc=0; reg=0; useReg=true}; {id=5; proc=0; pc=2; op= St; var=1; data=1; wrID=4; wrType=Remote; wrProc=0; reg=0; useReg=true}; {id=6; proc=0; pc=2; op= St; var=1; data=1; wrID=4; wrType=Remote; wrProc=1; reg=0; useReg=true}; {id=7; proc=1; pc=0; op= LdAcq; var=1; data=1; wrID=-1; wrType=DontCare; wrProc=-1; reg=1; useReg=true}; {id=8; proc=1; pc=1; op= Ld; var=0; data=0; wrID=-1; wrType=DontCare; wrProc=-1; reg=2; useReg=true} ... Tuple 8

How the SAT encoding is achieved... (actually QBF that’s unrolled...) Example Execution • Store c viewed at P1 for modeling bypassing • Store c viewed at P1 for modeling global visibility • Store c viewed at P2 for modeling global visibility • Store d viewed at P1 for modeling bypassing • Store d viewed at P1 for modeling global visibility • Store d viewed at P2 for modeling global visibility • Ld d viewed at P2 for modeling read value • Ld c viewed at P2 for modeling read value st c,1 ; st d,2 ld d, 2; ld c, 0 Break it down into “tuples” 8 tuples obtained legalItanium(ops) = Exists order. ( requireStrictTotalOrder ops order /\ requireOtherOrderItanium ops order /\ requireReadValue ops order SC(ops) = Exists order. ( requireStrictTotalOrder ops order /\ requireOtherOrderSC ops order /\ requireReadValue ops order

Constraint Encoding Approach #1 • n logn approach (“small domain” encoding) • Attach a word w_t of 2 bits to each tuple t • Tuple i before Tuple j --> Assert wi < wj • StrictTotalOrder --> Assert that the wt words are distinct • Smaller # of Boolean Vars • Much Harder SAT instances (abandoned for now) Illustration on 4 tuples requireStrictTotalOrder ops order requireOtherOrder ops order requireReadValueops order For all i, j: xi1,xi0 != xj1, xj0 x00 x01 x10 x11 A system of constraints with primitive constraint xi1, xi0< xj1, xj0 x20 x21 x30 x31

Constraint Encoding Approach #2 • n n approach (“e_ij” encoding) • Assign a matrix position mij for each pair of tuples ti and tj • Tuple i before Tuple j --> Assert mij true • StrictTotalOrder --> Assert Irreflexitivity, Transitivity, Totality • Larger # of Boolean Vars • Easier SAT instances (being pursued now) Illustration on 4 tuples • Forall i : ~mii • Forall i,j : mij \/ mji • Forall i,j,k : mij /\ mjk • => mik requireStrictTotalOrder ops order requireOtherOrder ops order requireReadValueops order i . . . . j . mij . . . . . . . . . . A system of constraints with primitive constraint mij

Transformation of HOL specs to generate constraints i k atomicWBRelease(ops,order) = forall (i in ops).(j in ops).(k in ops). (i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB) /\ (i.wrID = k.wrID) /\ order(i,j) /\ order(j,k) ==> (j.wrID = i.wrID) atomicWBRelease(ops,order) = forall (i in ops).(j in ops).(k in ops). (i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB) /\ (i.wrID = k.wrID) /\ ~(j.wrID = i.wrID) ==> ~(order(i,j) /\ order(j,k)) atomicWBRelease(ops,order) = forall (i in ops). (i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB) ==> forall (k in ops). (i.wrID = k.wrID) ==> forall (j in ops). ~(j.wrID = i.wrID) ==> ~(order(i,j) /\ order(j,k)) Initial Spec j Applying Contrapositive After Reducing quantifier Scopes

Functional (Ocaml) Program Derivation from HOL Specs: atomicWBRelease(ops,order) = forall (i in ops). (i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB) ==> forall (k in ops). (i.wrID = k.wrID) ==> forall (j in ops). ~(j.wrID = i.wrID) ==> ~(order(i,j) /\ order(j,k)) atomicWBRelease(ops) = forall(i,ops,wb(i)) wb(i) = if ~((attr_of i.var=WB) & (i.op=StRel) & (i.wrType=Remote) then true else forall(k,ops,wb1(i,k)) wb1(i,k) = if ~(i.wrID=k.wrID) then true else forall(j,ops,wb2(i,k,j)) wb2(i,k,j) = if (j.wrID=i.wrID) then true else ~(order(i,j) & order(j,k)) forall(i,S, e(i)) = for all i in S : e(i) (* foldr( map (fn i -> e(i)) (S) (&), true) *) Transformed Spec Functional Program that generates the constraints (will be automated)

Main Result: Formally hand-derived code worked first time on all 17 of Intel’s litmus tests! Previous ad-hoc Prolog code had to be massively debugged Formal derivation ensures that HOL axioms are preserved in code that generates SAT instances (very error-prone coding otherwise)

Partial Evaluation Approach under “nn” encoding • Can pre-generate these for various ‘n’ and save • We loaded SatZoo with these constraints and • checkpointed its runnable image for various ‘n’ • Incremental SAT solvers can really help! • Forall i : ~mii • Forall i,j : mij \/ mji • Forall i,j,k : mij /\ mjk • => mik requireStrictTotalOrder ops order requireOtherOrder ops order requireReadValueops order i . . . . j . mij . . . . . . . . . . Constraints on mij • These are unit-clause rich, and • hence very easy to re-generate • If *same* test re-run, variation • will only be in ReadValue rule

Recent Practical Test Run Suggests Bounded Cycle Checking: • Forall i : ~mii • Forall i,j : mij \/ mji • Forall i,j,k IN BOUNDED RANGES mij /\ mjk => mik requireStrictTotalOrder ops order requireOtherOrder ops order requireReadValueops order i . . . . j . mij . . . . . . . . . . • Process OtherOrder First • Incremental Constraint Gen • for ReadValue

Latest results • Intel-provided test of 124 instructions • Generated about 246 tuples • Will not finish unless we rework transitivity • Generated SAT instance with transitivity suppressed, and it found the violation (fluke) • 115,637 variables and 164,848 clauses • Zcore found UNSAT core of 9 clauses !! • Better methods to handle transitivity to be implemented • Upper triangular matrix alone will do • Lazy Transitivity introduction • Generate constraints w/o transitivity • If UNSAT, done • Else find SAT instance, force transitivity, re-check...

Gist of results • 1. n n method is superior despite using more bits • 2. Checkpointing method does pay-off upto 64 tuples... • 3. Present approach won’t allow more than 512 tuples • #clauses = 7 * n^3 + ... where n is the number of tuples • #variables = 2 * n^3 + ... • 4. Several solutions to be considered: • Generate upper-triangular alone • Generate w/o transitivity ; lazily introduce it • Look for natural limits due to CPU resources • Exploit barriers in code • Heuristically enumerate cycles of increasing sizes

Table of Results (details in paper) SAT-instance generation time for n logn method Tuples Total Order Other Order 32 0.2 1.6 64 1.2 17.1 128 5.7 179.0 SAT-instance generation time for n n method Tuples Total Order Other Order 32 0.5 0.1 64 4.3 0.9 128 34.2 9.0 SAT-checking times Tuples n logn nn Monolith TotalOrd OtherOrd Monolith TotalOrd OtherOrd 32 9.6 0.6 4.3 0.33 0.69 0.05 64 247.17 29.53 37.6 2.73 6.17 0.5 128 abort 1341 abort 164.8 145.6 351.1

Concluding Remarks • Constraint-based shared memory consistency specification • is advantageous in many ways • Preliminary work was done using Constraint-Prolog • Recent work being done using SAT • Need to focus on methods that can combine static constraint solving • (in program code) and explicit constraint solving • Consider Symbolic litmus-test verification as driving problem • - Application: MP code optimization (synchronization removal) • Non-standard interpretation methods might be able to combine • static constraint evaluation and explicit constraint evaluation

Problem Context #1: The design of efficient multiprocessors