Static Path-Aware Analysis of Program Invariants

Static Path-Aware Analysis of Program Invariants Murali Krishna Ramanathan Department of Computer Science Purdue University (joint work with Suresh Jagannathan and Ananth Grama)

How do I use this? Motivation Undocumented Program Expert Programmer New Programmer BUGS Tester

Context • What is a program invariant? • Property that must hold across all program executions • What is a failure? • Program run does not satisfy an expected invariant • System crashes • Logical bugs • Performance bugs • What is a specification? • Documentation of intended program invariants • e.g., lock must be followed by unlock • Unavailable or imprecise

Issues • Deriving specifications • Where do we start? • Absence of formal documentation • Legacy code • Identifying the source of failures • How do we search? • Exponential number of execution paths to explore • Representing common information among paths

Specification Inference • Challenges • What to look for? • Both relevant and irrelevant information present in the program source • How to be robust in the presence of bugs? • Assumptions • Programs are mostly well tested but can have bugs • Transparent – no programmer annotations

Kinds of specifications • Control-flow preconditions • A call to fopen must always precede a call to fgets • Data-flow preconditions • The result of a call to socket must always be checked for error before a call to bind • Control-flow postconditions • A call to fopen is either followed by a call to fclose or error • Control-flow divergence preconditions • A call to read can be preceded either by a call to open or socket • …

Preconditions fp := fopen(…) fp = fopen(…); if(fp != NULL) fgets(buf, SIZE, fp); • Predicate • Captures properties associated with variables and procedure calls • Preconditions for procedure • Composed of predicates that need to hold always before every call to a procedure fp != null fopen <- fgets

Types of predicates fp = fopen(…); if(fp != NULL) fgets(buf, SIZE, fp); • Data-flow • captures data flow properties associated with variables • fp is assigned the return of fopen, fp is not null, • Control-flow • define precedence properties among procedures • fgets is preceded by fopen

Control-flow preconditions (ICSE 07) 181 RI_FKey_check(PG_FUNCTION_ARGS) 182 { 199 ri_CheckTrigger(...); 210 pk_rel = heap_open(...); 296 match_type = ri_DetermineMatchType(...); 303 ri_BuildQueryKeyFull(...); 437 } “Check that RI trigger function was called in expected context” “Get the relation descriptors of the FK and PK tables…” “Convert the MATCH TYPE string into a switchable int” “Build up a new hashtable key for a prepared SPI Plan of a constraint trigger of MATCH FULL …”

Control-flow preconditions 181 RI_FKey_check(PG_FUNCTION_ARGS) 182 { 199 ri_CheckTrigger(...); 210 pk_rel = heap_open(...); 212 if(TRIGGER_FIRED_BY_UPDATE(...)) ... 218 else ... 231 if(!HeapTupleSatisfies(...)) ... 296 match_type = ri_DetermineMatchType(...); 298 if(match_type==RI_MATCH_TYPE_PARTIAL) 299 ereport(...); 303 ri_BuildQueryKeyFull(...); 437 }

Control-flow preconditions 181 RI_FKey_check(PG_FUNCTION_ARGS) 182 { 199 ri_CheckTrigger(...); 210 pk_rel = heap_open(...); 248 if (tgnargs == 4) 249 { 250 ri_BuildQueryKeyFull(...); 294 } 437 } ri_BuildQueryKeyFullnot preceded byri_DetermineMatchType Leads to a potential crash

Static Specification Mining • To generate preconditions for a procedure • Generate predicates at each call-site of the procedure • Ideally common predicates across all the call-sites form the preconditions for the procedure • How to find common predicates? • Use mining techniques • Construct patterns built from alignments or permutations of predicate sets • Approximation: Patterns appearing in programs denote preconditions

Approach • Analyze control-flow graph • Build precedence relation (a <- b): • A binary relation between procedures a and b • A call to b is always preceded by call to a • Necessitates an inter-procedural analysis • Relations can cross procedure boundaries • Convergence requires fixpoint calculation • Procedure signatures • Frequent subsequence mining • Mine the chains formed by precedence relations

Path Exploration Path-Sensitive Exploration: q <- p, q <- r <- p q Path-Insensitive Exploration: q , r <- p r q q Path-Aware Exploration: q <- p p

Precedes relation q q r t q q q exit p p q <-p q <-p

Inter-procedural Analysis h() { if(cond) lwrap(); else lwrap(); … uwrap(); } lwrap () { init(); } uwrap () { access(); }

Procedure Signatures s entry s u q t r q q s <- t s <- q <- p <- t Procedure signature for s: q <- p ret p

Mining sequences • Sequence mining: • Input: set of sequences (I) • Output: sequences that occur ‘frequently’ as subsequences in I • Use the Apriori-all algorithm [Agrawal and Srikant, Mining Sequential Patterns, ICDE ’95]

Motivation for sequence mining • Control paths: Invariant: • a, b, c, e a <- c <- e • g, a, d, c, e • a, c, e • a, c, d, e, f • e, f, d, a (Faulty path, no call to a and c before e) • Intersection of these paths • e is preceded by nothing • Use mining to overcome brittleness of path intersection

Sequence Mining - Example • Input sequences: Min Frequency: 4/5 • a, b, c, e • g, a, d, c, e • a, c, e • a, c, d, e, f • e, f, d, a • Input sequences: Min Frequency: 4/5 • a, b, c, e • g, a, d, c, e • a, c, e • a, c, d, e, f • e, f, d, a • Input sequences: Min Frequency: 4/5 • a, b, c, e • g, a, d, c, e • a, c, e • a, c, d, e, f • e, f, d, a • Input sequences: Min Frequency: 4/5 • a, b, c, e • g, a, d, c, e • a, c, e • a, c, d, e, f • e, f, d, a Maximal

Data-flow preconditions (PLDI 07) • Challenges • Data-flow predicates may be aliased • No anchors for data-flow predicates if (x > 0) f(x); if (y > 0) f(y); x = g(…); h(x); if(x > 0) f(x);

Motivating Example main(…) { for(ai = options.listen_addrs;…) { listen_sock = socket(ai->ai_family,…); if(listen_sock < 0) error(); if(num_listen_socks >= 16) error(); if((ret = getnameinfo(…))) … if(setsockopt(listen_sock,…) == -1) error(); if(bind(listen_sock, ai->ai_addr,…) < 0) … } } • In a call to bind, the first parameter is always assigned the return value of a call to socketand is checked for error

Generate Predicates main(…) { for(ai = options.listen_addrs;…) { listen_sock = socket(ai->ai_family,…); if(listen_sock < 0) error(); if(num_listen_socks >= 16) error(); if((ret = getnameinfo(…))) … if(setsockopt(listen_sock,…) == -1) error(); if(bind(listen_sock, ai->ai_addr,…) < 0) … } } listen_sock: return(socket), num_listen_socks: (<,16) (param_1, bind) ret: return(getnameinfo) (param_1, setsockopt), (>=,0)

Another call-site ssh_control_listener(void) { if(control_fd = socket(PF_UNIX,…) < 0) error(); old_umask = umask(0177); if(bind(control_fd,(struct sockaddr *)&addr,…)) … control_fd: return(socket), old_umask: return(umask) (param_1, bind) (>=,0)

Structural Similarity Problem listen_sock: return(socket), num_listen_socks: (<,16) (param_1, bind) ret: return(getnameinfo) (param_1, setsockopt), (>=,0) old_umask: return(umask) control_fd: return(socket), (param_1, bind) (>=, 0) • How to group the attribute sets that need to be mined together? • Find maximal matching of attribute sets • NP-hard • Use approximations based on program structures

Approximations • Type • attribute sets divided based on type of variable • Parameter • Supplied as arguments to the same parameter for any given procedure • Result • Variables that are assigned the return values of the same function • …

Example revisited listen_sock: return(socket), num_listen_socks: (<,16) • Variable names are not comparable • Use positional information • Different number of attributes • Interspersed with irrelevant operations (param_1, bind) ret: return(getnameinfo) (param_1, setsockopt), (>=,0) old_umask: return(umask) control_fd: return(socket), (param_1, bind) (>=, 0)

Is intersection robust? sockfd: return(socket), listen_sock: return(socket), • Same limitations as with control-flow preconditions • Adopt frequent itemset mining • Order of events is less critical • Aggregate collection of data-flow facts at call-sites (param_1, bind) (param_1, bind) (param_1, setsockopt), Precondition: (>=, 0) return(socket), (param_1, bind) control_fd: return(socket), (param_1, bind) (>=, 0) missing! (>=, 0)

Locality main() { fp = init_file(…); fgets(buf, SIZE, fp); } init_file(…) { fp = fopen(…); if(fp != NULL) return fp; exit(-1); } main() { fp = fopen(…); if(fp != NULL) read_file(fp); } read_file(FILE *fp) { … fgets(buf, SIZE, fp); … } • Interprocedural analysis to capture precondition crossing procedure boundaries

Example p1 p1, p2 q p1 s p1 s q p1 p1, p2 p1 r r s p1 p2 p1 t p2 Intraprocedural edge Interprocedural edge

Experiments • Applied on open source C programs • Input to the implementation: control flow graphs • Control flow nodes varied from 16K to 958K • Roughly 2M LoC • Procedure count varied from 298 to 8568 • Precondition predicates varied from 189 to 5963 • Analysis time varied from 26s to 20m

Experimental Goals • Path awareness improves precision • Useful for bug detection • Generates salient documentation

Effectiveness of path awareness • Fewer protocols generated using our approach • Reduction not at the expense of increase in false negatives • Reduces false positives

Bug Detection: Openssh • Procedure prime_testin openssh-4.4p1 • Testing difficult as it performs Miller-Rabin primality testing • Program crashes due to the absence of a error check • e.g., BN_mod_word(p, …), if p is null, program crashes • Fixed in openssh-4.5p1 • Error check not always necessary • e.g., BN_is_prime(…, ctx,…), ctx can either be null or pre-allocated

Bug detection • Case Study: Linux • Hardware Bug • Difficult to detect using traditional testing techniques • Platform dependent error • Transparently identified using our approach • Performance Bug • Cache lookup operation was absent • Not easily specified as a bug for testing • Deviation delays data write flushes • Difficult to identify using traditional testing techniques

Change in Confidence • Increase in confidence reduces the number of predicates

Related Work • Static techniques • Inferring Specifications from Within, Kremenek et al, OSDI 06 • Bugs as deviant behavior, Engler et al, SOSP 01 • … • Dynamic techniques • Strauss, Ammons et al, POPL 02 • Daikon, Ernst et al, TSE 01 • … • Our approach • Path-aware analysis • Generates preconditions • Predicates of arbitrary size • Annotation free

Future Work • Richer specifications • Post-conditions, divergence structures, … • More sophisticated mining techniques • Graph mining, … • Validating generated specifications • Integration with theorem prover • Specifications and concurrency • Atomicity violations

Other work • Dynamic analysis • Detecting cause of assertion failures (under review) • Static path profiles (under review) • Impact analysis – ASE 06 • Memory aliasing – FASE 06 • Test case prioritization – SAC 08 • Distributed Systems • Randomized leader election (Distributed Computing 07) • Eliminating duplicates in P2P systems (TPDS 07) • Search in P2P systems (P2P 05) • Efficient tag detection in RFID systems (SECON 05)

Why not mine post-conditions? fp = fopen(…); if(fp == NULL) exit(-1); fclose(…); • Precedence protocol: • A call to fclose is always preceded by a call to fopen • Successor protocol: • A call to fopen is always succeeded by a call to fclose

Why parameter tracing is insufficient? uldap_connection_find (…) { //code fragment from httpd if (APR_SUCCESS == apr_thread_mutex_trylock(l->lock)) { … compare_client_certs(st->client_certs, l->client_certs) … } • In a call to compare_client_certs, the return value of a call to apr_thread_mutex_trylock must be APR_SUCCESS. • Predicate for compare_client_certsincludes • “return value of apr_thread_mutex_trylock(…)is APR_SUCCESS”

Predicate size distribution • Majority of predicates less than 3

Static Path-Aware Analysis of Program Invariants

Static Path-Aware Analysis of Program Invariants

Presentation Transcript

An Overview on Static Program Analysis

Invariants

Static Analysis of Embedded C

F A S T Frequency-Aware Static Timing Analysis

Dynamically Detecting Likely Program Invariants

Avoiding Pitfalls of Static Analysis

Static Program Analysis for Verification - an Introduction -

PRECIS: Inferring Invariants Using Program Path Guided Clustering

Static Analysis

Static Analysis of Memory Errors

Review of Static Program Analysis and Co/Contra-variance

Static Program Analysis of Embedded Software

Static Program Analysis via Three-Valued Logic

AWARE Program

Static Analysis of Aspects

Static Analysis

Path Invariants or “How To Decompose Your Program Analysis”

Static Program Analysis of Embedded Software

Dynamically Discovering Likely Program Invariants

An Overview on Static Program Analysis

Declarative Static Program Analysis with Doop