Verification of Shared Memory Consistency Protocols Against Weak Memory Models

Shared Memory Consistency Protocol Verification against Weak Memory Models: Refinement via Model-checking Prosenjit Chatterjee, Hemanthkumar Sivaraj, Ganesh Gopalakrishnan School of Computing, University of Utah http://www.cs.utah.edu/formal_verification/ pchatterjee@nvidia.com { hemanth, ganesh } @ cs.utah.edu Supported by NSF awards CCR 9987516 and 0081406, and equipment gift from Intel Corpn.

…. cpu cpu cpu mem snoopy bus Shared memory multiprocessors Desktop machines Servers and Supercomputers … dir dir

How is the programmer’s view classically specified? Logical View “sequential consistency” Processors (“Coherence” means “per location SC”) Memory One disallowed scenario st(b,2); ld(a,0); st(a,1); ld(b,0); Initial memory contents = 0 cpu cpu Peterson? No! mem

Growing CPU / Memory performance gap necessitates weakenings… ‘Bypassing’ (read back own store before others) Aggressive load/store reorderings Strong orderings only at acquires/releases …. cpu cpu cpu mem … dir dir …all that and more!

Overall Features of Weak Memory Models • Support ‘ordinary’ as well as ‘special’ loads and stores • Support fences and synchronization primitives • Orderings may even depend on dynamic context • => Provide a much larger range of load-values Therefore… • Writing a formal specification is highly non-trivial • Writing a spec that supports verification is even trickier

A variety of highly intricate weak memory models exist PowerPC model Alpha model Sparc TSO / PSO / RMO Itanium RC_tso Java mem model

One almost wishes to go back to SC… “sequential consistency IS good” • Simplifies programming • Some hardware tricks to hide latencies It does not seem a realistic goal for now… • Range of such tricks limited • Complexity for end-users is containable

The Verification Problem Given • a formal specification of a weak consistency model (SPEC) • a finite-state model of the shared memory system (IMP) Verify that • the executions of IMP executions allowed by SPEC Our work enables this checking to be achieved using finite-state reachability

Related Work For SC • Qadeer [CAV’99, SRC TR #176] • Condon, Hu, et.al. [SPAA’01] • Nalumasu et.al. [CAV’98] • Dist. Computing Special Issue ‘99 • MPV Workshop [Post FMCAD’00] For Weak Models • Qadeer [MPV workshop] • Condon et.al. [HPCA’99] • Ghughal and Gopalakrishnan [FMPPTA’00]

Our Emphasis • Simple and intuitive SPECs • Support a wide range of memory models • Support automated (finite-state) verification • Avoid backtracking search over SPEC’s executions • Avoid bloating state-space beyond that of IMP

Verification Criterion Illustrated… Show that P1 st(a,1); ld(a); P2 ld(a); st(a,1) ; ld(a,1) ;ld(a, 0) IMP Implies Same execution Same program Spec

Idea: Employ a model-checker to establish refinement Executable SPEC load store load load values agree load values agree … store load load IMP • Must do a non-backtracking search over SPEC’s executions • SPEC must be deterministic with respect to recorded events therefore

What events do we record? Not just Loads and Stores! SPEC = Carbon-copy of Imp st(a,1) ; ld(a,1) ;ld(a, 1) P1 st(a,1); ld(a); P2 ld(a); phew! eh? st(a,1) ; ld(a,1) ;ld(a, 0) Imp P1 P2 st(a,1) ; ld(a,1) ;ld(a, 1) ld st ld st • st(a,1)drained to M • ld(a,1),ld(a,1)read from M LB SB LB SB st(a,1) ; ld(a,1) ;ld(a, 0) M • st(a,1)in SB ;ld(a,1)from SB • ld(a,0)from M

Use Visibility Order style SPECs • -- Already growing in use (Itanium spec, Neiger, Condon, …) • -- Helps export internal events to determinize SPEC’s executions • -- Defines Read Values to depend on most recent write Choices revealed.. ld(a,0) ; st_G(a,1) SPEC = Carbon-copy of Imp st_L(a,1) ; ld(a,1) ; st_G(a,1); ld(a,1) P1 st(a,1); ld(a); P2 ld(a); Imp st_L(a,1) ; ld(a,1) ;ld(a, 0) ; st_G(a,1)

Example of Visibility Order Spec (Condon, HPCA’99) In non-Visibility Order In Visibility Order style • : program order • : a total order of LD, ST_L, ST_G • is in TSO if • (Memory order constraints) • conditions on split stores • Read value rule • Value of LD, ‘X’ == • -- most recent ST_L, when ST_G is after X • (local bypassing) • -- most recent ST_G,otherwise • (local bypassing not exercised) • : program order • : memory order • is in TSO if • (Memory order constraints) • X Y /\ isLD(X) /\ isST(Y) => X Y • X MB Y => X Y • Read value rule • Value of LD, ‘X’ == • Value of closest store ‘Y’ • before or after ‘X’ in • (local bypassing detail is messy)

Our Contributions • Visibility order SPECs for a wide range of mem models • Built executable SPEC generator prototype • (runnable over web) • Verification of refinement using Parallel Murphi • (ported to MPI at Utah) • Verification without bloating IMP’s state-space • and without backtracking on SPEC’s executions • Two snoopy-bus protocols modeled after Alpha and Itanium • Two snoopy protocols where temporal order != visibility order • One directory-based protocol (‘Avalanche’ multiprocessor)

Details of our solution: Addressing the large “abstraction distance” IMP SPEC … dir dir

execution pipeline ld L1 cache L B S B cpu L2 cache Approach: Exploit Bug-classification … Inside Directories, Interconnects, … Inside CPU chips (More design groups have control over this) (Fewer design groups have control over this) So…develop Intermediate Abstraction that Retains External Partition

The Intermediate Abstraction Intermediate Abstraction SPEC IMP Retain internal partition FUTURE WORK THIS PAPER … Visibility order Read-value rule dir dir Simplify external partition

External Partition Replacement Depends on SPEC Memory Model Strong Weak Weakest Hybrid PC PowerPC PRAM Slow Memory Cache C* Causal C* Itanium Weak C* Entry C* Release C* TSO S C* IBM370 PSO RMO Alpha ( ‘C*’ means ‘Consistency’ )

Abstraction Method for External Partition local global local global global local global global

CPU1 CPU2 Pipe Pipe ld st ld st RB SB RB SB One memory (strong/weak) or One memory per CPU (weakest/hybrid) Creating the Intermediate Abstraction CPU1 CPU2 Pipe Pipe ld st ld st RB SB RB SB Snoopy-bus or Directory-based Memory Subsystem

Overall approach Generate Executable Spec Run it, and gain understanding Define Spec Start Phase 1 Final Spec Annotate Imp with events Design Imp Phase 2 Annotated Imp Failure Verify Impabs Success Phase 3 Derive Impabs Verify against Impabs Final Imp

store in SB store in M load from M st_L st_G ld load values agree? st_G ld st_L store in SB store in Cache load from LB or from Cache Intermediate Abstraction Verification IMP

Runs on 16 CPU Parallel Murphi ported to MPI at Utah Each CPU @ 850 MHz, 256 Mb per node (LAN communication) Alpha model w/o Barriers and LL/SC Itanium w/o weak ld/st Semaphores (RC_tso)

Features of Examples • Examples with Scheurich’s optimization: • -- Logical order != Temporal order • Directory Protocols: • -- a Migratory directory protocol using PV and SPIN • found no errors (parallel search not tried) • Other directory protocols as well as • Itanium (hybrid) memory model soon to be tried

Bugs likely to be caught • Not just coherence • SC violations • Write atomicity violations • Hybrid memory ordering violations • Bugs in internal partition: • will be caught when • intermediate abstraction compared against SPEC

How to scale up? • Improve parallel model-checker • Approximate search (e.g., parallel random-walk) • Bounded model-checking (enumerative or SAT) • Exploit data independence • Try many examples, and refine methodology

Conclusions • Efficient use of reachability analysis to verify • IMP against weak memory model SPEC • Applicable to a whole range of weak models • Selection of Intermediate Abstraction is systematic • Annotating Intermediate Abstractions is not hard • State explosion problem is not worsened • An easy-to-use verification technique that multiprocessor designers can use readily.

Extra Slides

“Visibility Order” explained using SC • SC executions have a single visibility order, V • Stores present in V consistent with prog. order (single store order) • Loads present in V consistent with prog. order • Each load to address A returns value D • that the most recent store in V to A wrote NON-SC SC P1 P2 P1 P2 st(b,2); ld(a,0); st(a,1); ld(b,0); ld(q,2); ld(p,1); st(p,1); st(q,2); st(a,1); ld(b,0); st(b,2); ld(a,0) st(p,1); ld(q,2); st(q,2); ld(p,1) whoops! OK!

P1 P2 P3 st(p,1); st(p,2); ld(p,2); ld(p,1); ld(p,1); ld(p,2); Writing visibility order specs for weak memory models… Can use single or multiple visibility orders [MPV workshop slides, see http://www.cs.utah.edu/mpv] Multiple VO needed for some weak mem models…. Single visibility order for TSO st_L(a,1) ld(a,1) ld(a,0) st_G(a,1) ld(a,1) P1 P2 Visibility order of P1 ..of P3 xxx ; ld(a,0); ld(a,1); st(a,1); ld(a,1); st(p,1) ld(p,1) st(p,2) ld(p,2) st(p,2) ld(p,2) st(p,1) ld(p,1) Split stores into Local and Global Single Global-store Order Stores kept unsplit

Always use single Visibility Order • Makes specification more intuitive • Can annotate Implementation model with • coherency events to obtain generated VO • Can compare against reliable Spec that • encompasses all legal VO using reachability analysis Our main idea Single visibility order for Itanium, obtained by splitting every Store into N copies st_1(p,1) ld(p,1) st_1(p,2) ld(p,2) st_2(p,2) ld(p,2) st_2(p,1) ld(p,1) P1 P2 P3 st(p,1); st(p,2); ld(p,2); ld(p,1); ld(p,1); ld(p,2);

Related Work on Verifying Against Weak Memory Models • Ghughal et.al. [FMPPTA’00] : • -- Extension of Collier’s work to weak memory models • -- Finite-state abstraction of “ARCHTESTs” to detect • ordering violations • Condon, Hill, Plakal, Sorin et.al [HPCA’99]: • -- Idea based on “Lamport Clocks” • -- Define “Wisconsin TSO” ordering for execution events • -- Assign Lamport Clock values to coherency events • -- Manual proof that Lamport Ordering (which traces • causalities, and hence read values) implies Wisconsin TSO • -- Defines single visibility order idea, but shows it only for • subsets of TSO and Alpha Main inspiration for our work

What are the observable effects on programs? ld(a,2); st(b,1); ld(b,1); st(a,2); lost atomicity …. cpu cpu mem st(b,2); ld(a,0); ld.acq(q,2); ld(p,1); st(a,1); ld(b,0); st(p,1); st.rel(q,2); only certain guarantees on executions …. cpu cpu cpu cpu mem

The Verification Problem • Shared Memory Implementations are very complex • Spec (shared memory consistency models) also highly non-trivial • => Verification engineers face a “double-whammy” • Mini Roadmap: • … Identifying the sources of memory model related bugs • … Related work on verifying against weak memory models • … How to verify against a broad taxonomy of mem models

Proc Proc ld st ld st RB SB RB SB Single Port Memory

The Verification Problem • Shared Memory Implementations are very complex • Spec (shared memory consistency models) also highly non-trivial • => Verification engineers face a “double-whammy” • Mini Roadmap: • … Identifying the sources of memory model related bugs • … Related work on verifying against weak memory models • … How to verify against a broad taxonomy of mem models

execution pipeline ld L1 cache L B S B cpu L2 cache Where are Ordering Relaxations Made? … Inside Directories, Interconnects, … Inside CPU chips (More design groups have control over this) (Fewer design groups have control over this) Techniques that focus on the “external partition” can still be quite useful…

Intermediate Abstraction Methodology IMP • Annotate Imp protocol with events of visibility order • -- designer reflects his/her understanding of mem model and Imp • Replace external partition specific to target memory model • Annotate intermediate abstraction thus obtained • Run reachability, matching every visibility event of Imp by • one produced by Intermediate Abstraction

Taxonomy of memory models, and external partitions for them (can use these in combination for hybrid models) Strong Weak Weakest Hybrid Instructions of many varieties Fences, Acq / Rel Write Atomicity No local bypassing Write Atomicity Local bypassing No Write Atomicity Coherence Pictures of ext partitions as well as brief explanation (pictorial) of how event-splitting is done

One allowed scenario under a weak memory model (e.g. Sparc TSO) st(b,2); ld(a,0); st(a,1); ld(b,0); cpu cpu mem

Verification of Shared Memory Consistency Protocols Against Weak Memory Models