Architecture-aware Analysis of Concurrent Software

Architecture-aware Analysis of Concurrent Software Rajeev Alur University of Pennsylvania Joint work with Sebastian Burckhardt and Milo Martin Intel Haifa Symposium, Sept 2009

Shared-memoryMultiprocessor Multi-threaded Software Concurrent Executions Bugs Challenge: Exploiting Concurrency, Correctly

yes/proof software/model Verifier correctness specification no/bug • Correctness is formalized as a mathematical claim to be proved or falsified rigorously • always with respect to the given specification • Is formal verification of “real” software possible? • Verification problem is undecidable • Even approximate versions are computationally intractable (model checking is Pspace-hard) • Use requires great expertise … and many such hurdles

1980s – Protocol analysis • Automated reachability analysis • Temporal logic model checking • Challenge: State-space explosion • Current tools: SPIN, CADP,.. • Main applications: Distributed algorithms, network protocols 1970s – Program verification • Proof calculi for proving correctness • Challenge: Finding invariants • Current tools: ACL2, PVS, ESC-Java • Main applications: Microprocessor verification, Correctness of JVM… 1990s – Symbolic model checking • Constraint-based analysis of boolean systems using fixpoints • Efficient data structures (OBDDs) • Bugs in Verilog/VHDL designs • Commercial tools and industrial groups (Cadence, NEC, Intel, Motorola, IBM, …)

2000s: Model Checking of C code do{ KeAcquireSpinLock(); nPacketsOld = nPackets; if(request){ request = request->Next; KeReleaseSpinLock(); nPackets++; } }while(nPackets!= nPacketsOld); KeReleaseSpinLock(); Phase 1: Given a program P, build an abstract finite-state (Boolean) model A such that set of behaviors of P is a subset of those of A (conservative abstraction) Phase 2: Model check A wrt specification: this can prove P to be correct, or reveal a bug in P, or suggest inadequacy of A • Shown to be effective on Windows device drivers in Microsoft Research project SLAM Does this code obey the locking spec?

Software Model Checking • Tools for verifying source code combine many techniques • Program analysis techniques such as slicing, range analysis • Abstraction • Model checking • Refinement from counter-examples • New challenges for model checking (beyond finite-state reachability analysis) • Recursion gives pushdown control • Pointers, dynamic creation of objects, inheritence…. • A very active and emerging research area • Abstraction-based tools: SLAM, BLAST,… • Direct state encoding: F-SOFT, CBMC, CheckFence…

Concurrency on Multiprocessors • Output not consistent with any interleaved execution! • can be the result of out-of-order stores • can be the result of out-of-order loads • improves performance, but unintuitive Initially x = y = 0 thread 1 thread 2 x = 1 y = 1 print y print x → 1 → 0

Architectures with Weak Memory Models • A modern multiprocessor does not enforce global ordering of all instructions for performance reasons • Each processor has pipelined architecture and executes multiple instructions simultaneously • Each processor has a local cache, and loads/stores to local cache become visible to other processors and shared memory at different times • Lamport (1979): Sequential consistency semantics for correctness of multiprocessor shared memory (like interleaving) • Considered too limiting, and many “relaxations” proposed • In theory: TSO, RMO, Relaxed … • In practice: Alpha, Intel IA-32, IBM 370, Sun SPARC, PowerPC … • Active research area in computer architecture

Programming with Weak Memory Models • Concurrent programming is already hard, shouldn’t the effects of weaker models be hidden from the programmer? • Mostly yes … • Safe programming using extensive use of synchronization primitives • Use locks for every access to shared data • Compilers use memory fences to enforce ordering • Not always … • Non-blocking data structures • Highly optimized library code for concurrency • Code for lock/unlock instructions

1 2 3 Non-blocking Queue (MS’96) Queue is being possibly updated concurrently boolean_t dequeue(queue_t *queue, value_t *pvalue) { node_t *head; node_t *tail; node_t *next; while (true) { head = queue->head; tail = queue->tail; next = head->next; if (head == queue->head) { if (head == tail) { if (next == 0) return false; cas(&queue->tail, (uint32) tail, (uint32) next); } else { *pvalue = next->value; if (cas(&queue->head, (uint32) head, (uint32) next)) break; } } } delete_node(head); return true; } head tail Atomic compare-and-swap for synchronization

Software Model Checking for Concurrent Code on Multiprocessors Why?: Real bugs in real code • Opportunities • 10s—100s lines of low-level library C code • Hard to design and verify -> buggy • Effects of weak memory models, fences … • Challenges • Lots of behaviors possible: high level of concurrency • How to formalize and reason about weak memory models?

Talk Outline • Motivation • Relaxed Memory Models • CheckFence: Analysis tool for Concurrent Data Types

Hardware Verification Software Verification Hierarchy of Abstractions Programs (multi-threaded) -- Synchronization primitives: locks, atomic blocks -- Library primitives (e.g. shared queues) Application level memory model Assembly code -- Synchronization primitives: compare&swap, LL/SC -- Loads/stores from shared memory Architecture level memory model Hardware -- multiple processors -- write buffers, caches, communication bus …

Shared Memory Consistency Models • Specifies restrictions on what values a read from shared memory can return • Program Order: x <p y if x and y are instructions belonging to the same thread and x appears beforey • Sequential Consistency (Lamport 79): There exists a global order < of all accesses such that • If x <p ythen x < y • Each load returns value of most recent, according to <, store to the same location (or initial value, if no such store exists) • Clean abstraction for programmers, but high implementation cost

Effect of Memory Model Initially flag1 = flag2 = 0 • Ensures mutual exclusion if architecture supports SC memory • Most architectures do not enforce ordering of accesses to different memory locations • Does not ensure mutual exclusion under weaker models • Ordering can be enforced using “fence” instructions • Insert MEMBAR between lines 1 and 2 to ensure mutual exclusion thread 2 thread 1 1. flag1 = 1; 2. if (flag2 == 0) crit. sect. 1. flag2 = 1; 2. if (flag1 == 0) crit. sect.

Relaxed Memory Models • A large variety of models exist; a good starting point: Shared Memory Consistency Models: A tutorial IEEE Computer 96, Adve & Gharachorloo • How to relax memory order requirement? • Operations of same thread to different locations need not be globally ordered • Howto relax write atomicity requirement? • Read may return value of a write not yet globally visible • Uniprocessor semantics preserved • Typically defined in architecture manuals (e.g. SPARC manual)

Unusual Effects of Memory Models Initially A = flag1 = flag2 = 0 thread 2 thread 1 flag1 = 1; A = 1; reg1 = A; reg2 = flag2; flag2 = 1; A = 2; reg3 = A; reg4 = flag1; Result reg1 = 1; reg3 = 2; reg2 = reg4 = 0 • Possible on TSO/SPARC • Write to A propagated only to local reads to A • Reads to flags can occur before writes to flags • Not allowed on IBM 370 • Read of A on a processor waits till write to A is complete

Which Memory Model should a Verifier use? • Memory models are platform dependent • We propose a conservative approximation “Relaxed” to capture common effects • Once code is correct for “Relaxed”, it is correct for many models • Tool allows user to specify a memory model using axioms RMO PSO TSO 390 SC Alpha IA-32 Relaxed

Formalization of Relaxed • Program Order: x <p y if x and y are instructions belonging to the same thread and x appears beforey • Execution over a set X of accesses is correct wrt Relaxed if there exists a total order < over X such that • If x <p y, and both x and y are accesses to the same address, and y is a store, then x < y must hold • For a load l and a store s visible to l, either s and l have same value, or there exists another store s’ visible to l with s < s’ A store s is visible to load l if they are to the same address and either s < l or s <p l (i.e. stores are locally visible) • Constraint-based specification that can be easily encoded in logical formulas

Talk Outline • Motivation • Relaxed memory models • CheckFence: Analysis tool for Concurrent Data Types

CheckFence Focus concurrency libraries with lock-free synchronization ... are simple, fast, and safe to use • concurrent versions of queues, sets, maps, etc. • more concurrency, less waiting • fewer deadlocks ... are notoriously hard to design and verify • tricky interleavings routinely escape reasoning and testing • exposed to relaxed memory models code needs to contain memory fences for correct operation!

CheckFenceTool Concurrent Algorithms: + Lock-free queues, sets, lists References[CAV 2006], [PLDI 2007] Burckhardt thesis Computer-Aided Verification: + Model checking C code + Counterexamples Architecture: + Multiprocessors + Relaxed memory models

The client program on multiple processors calls operations Processor 1 Processor 2 void enqueue(int val) { ... } int dequeue() { ... } .... ... enqueue(1) ... enqueue(2) .... .... .... .... ... ... a = dequeue() b = dequeue() Non-blocking Queue The implementation • optimized: no locks. • not race-free • exposed to memory model

Non-blocking Queue (MS’96) boolean_t dequeue(queue_t *queue, value_t *pvalue) { node_t *head; node_t *tail; node_t *next; while (true) { head = queue->head; tail = queue->tail; next = head->next; if (head == queue->head) { if (head == tail) { if (next == 0) return false; cas(&queue->tail, (uint32) tail, (uint32) next); } else { *pvalue = next->value; if (cas(&queue->head, (uint32) head, (uint32) next)) break; } } } delete_node(head); return true; }

Observation Witness Interleaving enqueue(1) enqueue(2) dequeue() -> 1 dequeue() -> 2 enqueue(1) dequeue() -> 2 enqueue(2) dequeue() -> 1 Correctness Condition Data type implementations must appear sequentially consistent to the client program: the observed argument and return values must be consistent with some interleaved, atomic execution of the operations.

thread 1 enqueue(X) thread 2 dequeue() → Y How To Bound Executions • Verify individual “symbolic tests” • finite number of concurrent threads • finite number of operations/thread • nondeterministic input values • Example • User creates suite of tests of increasing size

Why symbolic test programs? • 1) Make everything finite • State is unbounded (dynamic memory allocation)... is bounded for individual test • Checking sequential consistency is undecidable (AMP 96)... is decidable for individual test • 2) Gives us finite instruction sequence to work with • State space too large for interleaved system model.... can directly encode value flow between instructions • Memory model specified by axioms .... can directly encode ordering axioms on instructions

CheckFence Memory Model Axioms Bounded Model Checker Pass: all executions of the test are observationally equivalent to a serial execution Fail: Inconclusive: runs out of time or memory

Trace Tool Architecture C code Memory model Symbolic Test Symbolic test gives exponentially many executions (symbolic inputs, dynamic memory allocation, ordering of instructions). CheckFence solves for “incorrect” executions.

Trace construct CNF formula whose solutions correspond precisely to the concurrent executions C code Memory model Symbolic Test Symbolic Test automatic, lazyloop unrolling automatic specification mining (enumerate correct observations)

thread 1 enqueue(X); enqueue(Y) thread 2 dequeue() → Z Specification Mining Possible Operation-level Interleavings enqueue(X) enqueue(Y) dequeue() -> Z enqueue(X) dequeue() -> Z enqueue(Y) dequeue() -> Z enqueue(X) enqueue(Y) For each interleaving, obtain symbolic constraint by encoding corresponding executions in SAT solver Spec is disjunction of all possibilities: Spec: (Z=X) | (Z=null) To find bugs, check satisfiability ofPhi & ~ Spec where Phi encodes all possible concurrent executions

Encoding Memory Order thread 1 thread 2 s1 store s2 store l1 load l2 load • Variables for encoding • Use boolean vars for relative order (x<y) of memory accesses • Use bitvector variables Ax and Dx for address and data values associated with memory access x • Encode constraints • encode transitivity of memory order • encode ordering axioms of the memory modelExample (for SC): (s1<s2) & (l1<l2) • encode value flow “Loaded value must match last value stored to same address”Example: value must flow from s1 to l1 under following conditions:((s1<l1)&(As1 = Al1)&((s2<s1)|(l1<s2)|(As2 != Al1)))->(Ds1= Dl1)

1 2 3 Processor 2 reads value at head of list ... 3 node->value = 2; ... 1 head = node; ... ... 2 value = head->value; ... --> Processor 2 loads uninitialized value Example: Memory Model Bug head Processor 1 links new node into list Processor 1 reorders the stores! memory accesses happen in order 1 2 3 adding afence between lines on left side prevents reordering

Algorithms Analyzed

Type Description regular bugs Queue Two-lock queue Queue Non-blocking queue Set Lazy list-based set 1 unknown Set Nonblocking list Deque original “snark” 2 known Deque fixed “snark” LL/VL/SC CAS-based LL/VL/SC Bounded Tags • snark algorithm has 2 known bugs • lazy list-based set had a unknown bug(missing initialization; missed by formal correctness proof [CAV 2006] because of hand-translation of pseudocode) Results

snark algorithm has 2 known bugs • lazy list-based set had a unknown bug(missing initialization; missed by formal correctness proof [CAV 2006] because of hand-translation of pseudocode) • Many failures on relaxed memory model • inserted fences by hand to fix them • small testcases sufficient for this purpose Type Description regular bugs # Fences inserted StoreStore Load Load DependentLoads AliasedLoads Queue Two-lock queue 1 1 Queue Non-blocking queue 2 4 1 2 Set Lazy list-based set 1 unknown 1 3 Set Nonblocking list 1 2 3 Deque original “snark” 2 known Deque fixed “snark” 4 2 4 6 LL/VL/SC CAS-based 3 LL/VL/SC Bounded Tags 4 Results

Typical Tool Performance • Very efficient on small testcases (< 100 memory accesses)Example (nonblocking queue): T0 = i (e | d) T1 = i (e | e | d | d )- find counterexamples within a few seconds- verify within a few minutes- enough to cover all 9 fences in nonblocking queue • Slows down with increasing number of memory accesses in testExample (snark deque):Dq = pop_l | pop_l | pop_r | pop_r | push_l | push_l | push_r | push_r - has 134 memory accesses (77 loads, 57 stores)- Dq finds second snark bug within ~1 hour • Does not scale past ~300 memory accesses

Summary • Software model checking of low-level concurrent software requires encoding of memory models • Challenge for model checking due to high level of concurrency and axiomatic specifications • Opportunity to find bugs in library code that’s hard to design and verify • CheckFence project at Penn • SAT-based bounded model checking for concurrent data types • Bugs in real code with fences

Research Challenges • What’s the best way to verify C code (on relaxed memory models)? • SAT-based encoding seems suitable to capture specifications of memory models, but many opportunities for improvement • Can one develop abstract operational abstract models for multiprocessor architectures? • Proof methods for relaxed memory models • Language-level memory models: Can we verify Java concurrency libraries using the new Java memory model? • Hardware support for transactional memory • Current interest in industry and architecture research • Can formal verification influence designs/standards?

Simple Usable by programmers Programs (multi-threaded) Application level concurrency model System-level code Concurrency libraries Architecture-aware Concurrency Analysis Architecture level concurrency model Highly parallel hardware -- multicores, SoCs Complex Efficient use of parallelism

Architecture-aware Analysis of Concurrent Software