Peter Popov Centre for Software Reliability City University, United Kingdom ptp@csr.city.ac.uk

Software fault-tolerance with off-the-shelf components:from conceptual models to empirical studies Peter Popov Centre for Software Reliability City University, United Kingdom ptp@csr.city.ac.uk www.csr.city.ac.uk IPP-BAS, Sofia, 10 May 2006

Talk outline • What is CSR? • Software design diversity - Why? • Conceptual models of software fault-tolerance • Early conceptual models ‘on average’ • New results • effect of testing • models ‘in particular’ • Experimental studies with diverse redundancy: off-the-shelf RDBMS • fault diversity • performance implications • Conclusions IPP-BAS, Sofia, 10 May 2006

What is CSR? • The Centre for Software Reliability (CSR) at City University, London, is an independent Research Centre in the School of Informatics founded in 1983 by Bev Littlewood. • Over the years, CSR has attracted over 8 million of international and UK research funding, and has built an international reputation for research achievements in the areas of: • Software dependability (particularly safety and reliability) modelling • Software fault tolerance • Software metrics and quality assurance • Fundamental issues for safety critical systems IPP-BAS, Sofia, 10 May 2006

CSR's research achievements • Developing novel techniques and tools for assessing software reliability (e.g. software reliability growth, Bayesian methods for dependability assessment, etc.) • Defining the limits for evaluating systems with ultra-high reliability requirements • Developing models of common-mode failures in diverse and redundant systems • Developing novel mechanisms for software fault-tolerance • Producing a comprehensive framework for software data collection • Developing a widely-used method for assessing efficacy of software standards and methods • Developing a new understanding of the fundamentals of software testing • Providing a formal foundation for the software metrics area IPP-BAS, Sofia, 10 May 2006

Recent Projects on Diversity • DISCS (Diversity In Safety-Critical Systems) - 1997-2000, funded by EPSRS, UK • DISPO (DISPO2-DISPO5) 1997-2007, funded by British Energy • most of modelling work presented here completed as part of DISPO projects • DOTS (Diversity with Off-the-Shelf Software), 2000-2004, funded by EPSRS, UK • the experimental work with Off-the-Shelf presented here has been completed in the DOTS project • DIRC (Dependability Interdisciplinary Research Collaboration) - 2000-2006, EPSRC • diversity in socio-technical systems has been the main concern IPP-BAS, Sofia, 10 May 2006

Software design diversity: Why • The idea of redundancy (i.e. multiple software channels) for increased reliability/availability is not new: • has been used in hardware design for a very long time. • simple redundancy does not work with software • software failures are deterministic: whenever a software fault is triggered a failure will result • software does not ware out • software channels work in parallel, but must be: • different by design (design diversity) • work in (slightly) conditions (data diversity) IPP-BAS, Sofia, 10 May 2006

Software design diversity (2) • Surprisingly, various homogeneous fail-over schemes dominate the market of FT ‘enterprise’ applications. These are ineffective! • U.S.-Canada Power System Outage Task Force, Final Report on the August 14th (2003) Blackout in the United States and Canada • https://reports.energy.gov/BlackoutFinal-Web.pdf EMS Server Failures. FE’s EMS system includes several server nodes that perform the higher functions of the EMS. Although any one of them can host all of the functions, FE’s normal system configuration is to have a number of host subsets of the applications, with one server remaining in a “hot-standby” mode as a backup to the others should any fail. At 14:41 EDT, the primary server hosting the EMS alarm processing application failed, due either to the stalling of the alarm application, “queuing” to the remote EMS terminals, or some combination of the two. Following preprogrammed instructions, the alarm system application and all other EMS software running on the first server automatically transferred (“failedover”) onto the back-up server. However, because the alarm application moved intact onto the backup while still stalled and ineffective, the backup server failed 13 minuteslater, at 14:54 EDT. Accordingly, all of the EMS applications on these two servers stopped running.(Part 2, p 32) IPP-BAS, Sofia, 10 May 2006

Software design diversity (3) • The main architectures for symmetric software fault-tolerance: • NVP (N-version programming), Avizienis, 1977 • Recovery Blocks, Randell, 1975 • NSCP (N-self-checking programming), Laprie et al., 1987 • Endless number of architectures with asymmetric fault-tolerance (i.e. a ‘primary’ channel and a ‘checker’) IPP-BAS, Sofia, 10 May 2006

Channel 1 Parallel (OR, 1-out-of-2) arrangements inputs Channel 2 Examples: diverse, modular redundancy • “natural” 1-out-of-2 scheme (e.g. communication, alarm, protection) • Voted system (e.g. control) System output Channel 1 Bespoke adjudicator Channel 2 inputs Channel 3 IPP-BAS, Sofia, 10 May 2006

Computation Input Primary software System output checker Approved/ rejected Examples: primary/checker systems • Checker will usually be bespoke (possibly on OTS platform) • If simpler than primary high quality is affordable • Safety kernel idea IPP-BAS, Sofia, 10 May 2006

Achievement vs. Assessment • Cost-benefit analysis is always needed: • design diversity is more expensive than non-diverse redundancy, or solutions without redundancy • especially in 80s, when the area was actively researched • what are the benefits of design diversity, how much one gains from diverse redundancy? • Assessing the benefits is a problem much harder for (diverse) software than for hardware • NVP ‘implicitly’ assumed independence of failures of the channels • huge controversy, at times very entertaining exchange in the IEEE Transaction on Software Engineering in mid 80s. IPP-BAS, Sofia, 10 May 2006

Is failure independence realistic? • Knight and Leveson experiment (FTCS-15, 1985 and TSE, 1986) • 27 software versions developed to the same specification by students in two US Universities • tested on 1,000,000 test cases and the versions’ reliability ‘measured’ • Coincident failures observed much more frequently than independence would suggest • i.e. refuted convincingly the hypothesis of statistical independence between the failures of the independently developed versions! • Eckhardt & Lee model (TSE, 1985) • probabilistic model demonstrates why independently developed versions will not fail independently IPP-BAS, Sofia, 10 May 2006

Eckhardt and Lee model • Model of software development • population of possible versions ={1, 2...} • probabilistic measure S(), i.e. S(i) is the probability that version i will be developed • Demand space modelled probabilistically • D={x1, x2...} - demand space, • Q() probabilistic measure: the likelihood of different demands being chosen in operation. IPP-BAS, Sofia, 10 May 2006

Eckhardt and Lee model (2) • The random variable(,X)represents the performance of a random program on a random demand: this is a model for the uncertainty both in software development and usage. is the probability that a randomly chosen program fails for a particular demand x (‘difficulty’ function). • (X) is a random variable • upper case X represents a random demand, i.e chosen in operation at random according to Q() IPP-BAS, Sofia, 10 May 2006

Eckhardt and Lee model (3) There is no reason to expect that independently developed software versions will fail independently on a randomly chosen demand, X, even though they fail conditionally independently on a given demand, x. IPP-BAS, Sofia, 10 May 2006

Littlewood and Miller model • A generalisation of the EL model for the case of ‘forced diversity’ • the development teams are kept apart but forced to use different methodologies, e.g. programming languages, etc. • Model of forced diversity • probabilistic measures SA() and SB() for development methodologies, A and B. • a version (with a specific set of scores,(,x)) may be very likely with methodology A and very unlikely with methodology B • The model in every other aspect is identical to the EL model. IPP-BAS, Sofia, 10 May 2006

Littlewood and Miller model (2) • Since covariance can be negative, then with forced diversity one may do even better than the unattainable independence under the EL model • Littlewood & Miller in their TSE paper 1989 applied their model to Knight & Leveson’s data and discovered negative covariance. • For them the two methodologies were represented by the programs developed by students from different universities. IPP-BAS, Sofia, 10 May 2006

Limitations of EL and LM models • Eckhardt and Lee (EL) and Littlewood and Miller (LM) models deal with a ‘snapshot’ of the population of versions • extended by allowing the versions to evolve through their being tested and fixing the detected faults • These are models ‘on average’ • extended by looking at models of a particular pair of versions (models ‘in particular’). IPP-BAS, Sofia, 10 May 2006

version i version i tested with j tested with k no testing version i A new model ‘on average’ Testing with suite j Testing with suite k Testing: - test suite (a given test generation procedure may be instantiated differently, i.e. different sets of test cases can be generated) - independently generated for each channel of the system; - the same test suite used; - adjudication (oracle: perfect/imperfect, back-to-back) - fault-removal (perfect/imperfect, new faults?) IPP-BAS, Sofia, 10 May 2006

Modelling the testing process • ={t1,t2,...} withM() , i.e. M(t) = P(T=t) • Extended score function: is the score ofonxbefore testing IPP-BAS, Sofia, 10 May 2006

Comparison of testing regimes • Testing with oracles: • Detailed analysis with perfect oracles: • testing with oracles on independently chosen testing suites; • testing with oracles on the same testing suite; • Speculative analysis of oracle imperfection • ‘back-to-back’ testing - lower and upper bounds identified under simplifying assumptions IPP-BAS, Sofia, 10 May 2006

Marginal probability of system failure Without forced diversity Independent test suites (EL result applies) The same suite (EL result does not apply). May be significantly worse than with independent suites IPP-BAS, Sofia, 10 May 2006

Marginal system pfd (2) With forced Diversity Independent suites (LM result applies). The same suite (LM result does not apply). May be better or worse than independent suites! Better than with independent suites is counterintuitive but possible (a contrived example constructed )! IPP-BAS, Sofia, 10 May 2006

Back-to-back testing • Back-to-back testing is a special case of testing with the same suite. It cannot do better (lower bound on system pfd) than the same suite with perfect oracles. • In the worst case (an upper bound on system pfd) the back-to-back testing only removes the faults which cause non-coincident failures, i.e. with no impact on system pfd (untested population). • Bounds on the performance of back-to-back testing obtained. • Detailed modelling of realistic back-to-back testing (partial overlapping of failure regions) does not seem feasible and worthy! IPP-BAS, Sofia, 10 May 2006

In summary • Performance of testing regimes (no account of the cost): • (best in terms of average system reliability achievable) independent testing with oracles; • (worse) testing with the same suite and oracles; • (worst) back-to-back testing. • Accounting for the cost may change this ordering! A trade-off can be struck, which depends on cost of test suite generation and cost of testing. • Counterintuitive observation: • forced diversity combined with testing with the same suite may lead to better system pfd than testing with independent suites (i.e. better result can be achieved more cheaply!) IPP-BAS, Sofia, 10 May 2006

Models ‘in particular’ IPP-BAS, Sofia, 10 May 2006

Model in particular: full knowledge of scores • Scores and operational profile as with models ‘on average’. • Similar to the LM result, except that PA and PB are pfd of the channels, not of an ‘average version’, i.e. are estimable. IPP-BAS, Sofia, 10 May 2006

S3 S1 S2 S4 With failure data about sub-domains in demand space... • "coarse-grain" info: P(failure |Si ) for each subdomain Si • upper and lower bounds: • should not trustpfd to be any lower than • S P(Si ) P(A fails | Si ) P(B fails | Si ) • can trustpfd to be lower than • P(S )  min ( P(A fails | S), P(B fails | S ) ) • can improve estimate with observation from actual • behaviour of A and B channels in new system. IPP-BAS, Sofia, 10 May 2006

Model in particular: summary • Useful bounds can be obtained without detailed knowledge of dependence between channel failures • if these are tight, then the details (covariance terms) become irrelevant • increasing the number of sub-domains allows one to seek tighter bounds • Models ‘in particular’ can be applied to concrete FT system, while the models ‘on average’ are only useful as guidelines as to what is believable and what is not, but useless for actual reliability prediction IPP-BAS, Sofia, 10 May 2006

Fault Diversity among Off-The-Shelf SQL Servers The work was completed together with Ilir Gashi, a PhD student of at CSR, working under my supervision IPP-BAS, Sofia, 10 May 2006

Motivation for the study • Fault-tolerance with off-the-shelf software becomes cheaper than with bespoke development but what is the dependability gain? • Empirical evidence is needed that the effort to build fault-tolerance with OTS is worthy • But what software to use? • Toy examples? Open to criticism that findings are not applicable to complex software: • the gains may be very different • difficulties of building FT solutions with diverse OTS s/w may be too high and the good idea is not practicable • We avoid the first criticism by having decided to study complex OTS software such as RDBMS (SQL servers) • we plan to address the second criticism in the near future IPP-BAS, Sofia, 10 May 2006

Motivation for the study (2) • Reliability gain depends on operational environment! • We tried statistical testing - created a prototype of a harness (with TU Plovdiv, Dr Valentin Mollov) • ‘discovered’ a fault of MSSQL fixed in SPs and actually could measure the gain for different profiles • What if profile is unknown - always true prior to deployment of general purpose software such as RDBMS • evidence of the effectiveness of fault tolerance can be gathered from the known faults, i.e. reported bugs, for the OTS software • some of faults may have been fixed in the new releases but there is no strong reason (unless we have empirical evidence on the contrary) to think that the ‘pattern’ of the faults, which lead to coincident failures will change. IPP-BAS, Sofia, 10 May 2006

Fault-Tolerant database replication with Diverse SQL servers • Effectiveness of the architecture in the end depends on the assumptions made about the failures • with database replication the assumption of ‘fail-silent’ failure is very common • the study allows us to validate this assumption on the reported bugs IPP-BAS, Sofia, 10 May 2006

Size of the study • Servers included in study: • Open source: • PostgreSQL v. 7.0.0 (PG) • Interbase v. 6.0 (IB) • Closed development: • Oracle v. 8.0.5 (Oracle) • MSSQL v. 7 (MSSQL) • 181 known bug reports for all the servers together were collected. IPP-BAS, Sofia, 10 May 2006

IB faults IPP-BAS, Sofia, 10 May 2006

PG, Oracle, MSSQL IPP-BAS, Sofia, 10 May 2006

2 - Version combinations IPP-BAS, Sofia, 10 May 2006

Detectability of failures • Percentage of bugs causing crash failure varies between servers from 13% (MS SQL) to 21% (Oracle and PostgreSQL). • A non-diverse scheme would only detect the self-evident failures: • crash failures, • failures reported by the server itself and poor performance failures. For all four servers, less than 50% of bugs cause such failures. • With diverse pairs detectability is greatly improved: • all the possible two-version fault-tolerant configurations detect the failures caused by at least 94% of the bugs used in the study. • None of the bugs caused a failure in more than two servers. • Other issues: • diagnosability (if different results received from replicas which, if any, is giving a correct answer). • recovery IPP-BAS, Sofia, 10 May 2006

Future work • Interpreting the results. What do the empirical studies like ours really tell us? Work in progress. • The later versions of these servers are installed and used in a new study with new bugs collected: • consistency established between the ‘snapshots’ (i.e. releases) • used FB 1.0, FB1.5 (Interbase’s successor) and PostgreSQL 7.2 • Work on diagnosing the failed server when a discrepancy between the two diverse replicas detected • using data diversity (functional redundancy built in SQL servers) • a paper submitted to EDCC’06 and another one in preparation for IEEE Transactions on Dependable and Secure Computing IPP-BAS, Sofia, 10 May 2006

Performance Implications of Diverse redundancy with SQL servers • The work was completed together with Vladimir Stankovic, a PhD student of at CSR, working under my supervision • Measurements using a ‘standard’ performance benchmark application for online-transaction processing, TPC-C. FT-node IPP-BAS, Sofia, 10 May 2006

Regimes of Operation of Middleware • Slowest response, • waits for responses to statements from all replicas • adjudicates them • returns the adjudicated response, if any, to the client • Fastest response • waits for the first response to a statement (the first response) • returns it to the client • waits for the rest of the responses, adjudicates them • Optimistic • waits for the first response • returns it to the client, ignores all later responses • a server can ‘skip’ READ statements if a response already received. IPP-BAS, Sofia, 10 May 2006

Experimental Harness • client machine: 1.5 GHz Intel Pentium 4 processor, 640 MB RAMBUS RAM and 20GB HDD (Maxtor DiamondM) • Operating system: Windows 2000 Professional, SP4 • Own implementation in Java of TPC-C with multithreading to allow for simulating concurrent clients • server machines: 1.5 GHz Intel Pentium 4 processor, 384 MB RAMBUS RAM and 20GB HDD (Seagate U Series). • Operating system: Linux RedHat 6.0 (Hedwig) • Interbase 6.0, PostgreSQL 7.4.0. IPP-BAS, Sofia, 10 May 2006

Experiment parameters FT-node configurations: • 1IB1PG: an FT-node with a copy of IB and PG. • 1IB, a single replica of IB; • 1PG, a single replica of PG; • 2IB, an FT-node with two replicas of IB, and • 2PG, an FT-node with two replicas of PG. Server loads: • a single TPC-C compliant client; • a single TPC-C compliant +10 additional clients, • a single TPC-C compliant + 50 additional clients. Size of experiments: • 10,000 transactions executed by the TPC-C client; IPP-BAS, Sofia, 10 May 2006

Results IPP-BAS, Sofia, 10 May 2006

Results (2) IPP-BAS, Sofia, 10 May 2006

Results (3) IPP-BAS, Sofia, 10 May 2006

Summary: Performance measurements • An intriguing possibility to use diversity for ‘performance boost’ • More interestingly, diversity offers a range of ‘quality of service’ options controlled by the client applications: • best dependability assurance • and performance • ‘in between’ • Using ‘learning algorithms’, e.g. Bayesian inference, to switch between high dependability assurance and high performance. IPP-BAS, Sofia, 10 May 2006

Conclusions • Diverse Redundancy is a very mature discipline with more than 30 years of active research • Its main ‘problem’ has been high cost, but this is changing with off-the-shelf software(especially in the context of Service-Oriented Architecture) • functionally similar/identical products (SQL servers, Application servers, BPEL engines, etc.) available as affordable commodities (even open-source) • Industry is reluctant to accept that the diverse redundancy does not have an alternative: • formal methods cannot cope with complex software currently deployed • diversity and formal methods can be combined (e.g. APIs of diverse off-the-shelf products can be ‘harmonised’ using formal methods) • fault-avoidance is ‘hopeless’ (i.e. too expensive to deliver an acceptable level of dependability) IPP-BAS, Sofia, 10 May 2006

Research referenced in this talk: 1. Littlewood, B., Popov, P., Strigini, L. and Shryane, N., "Modelling the effects of combining diverse software fault removal techniques", IEEE Transactions on Software Engineering, vol. SE-26, no. 12, pp.1157-1167, 2000. 2. Littlewood, B., Popov, P. and Strigini, L., "Modelling software design diversity - a review", ACM Computing Surveys, vol. 33, no. 2, pp. 177 - 208, 2001. 3. P. Popov, L. Strigini, J. May and S. Kuball, "Estimating Bounds on the Reliability of Diverse Systems", IEEE Transactions on Software Engineering, vol. 29, no. 4, 2003, pp.345-359, 2003. 4. Littlewood, B., Popov, P., Strigini, L., "Assessing the reliability of diverse fault-tolerant software-based systems", Safety Science, vol. 40, pp. 781-796, Pergamon, 2002. 5. Littlewood B, Popov P, Strigini L., "A note on reliability estimation of functionally diverse systems", Reliability Engineering and System Safety, vol.66, pp 93 - 95, 1999. 6. Popov P. and Strigini L., "Conceptual models for the reliability of diverse systems - new results", Proc. 28th International Symposium on Fault-Tolerant Computing (FTCS-28), Munich, Germany, IEEE Computer Society Press, pp. 80-89, 1998. 7. P. Popov, B. Littlewood, "The effect of testing on the reliability of fault-tolerant software", Proc International Conference on Dependable Systems and Networks (DSN2004), pp. 265-274, IEEE Computer Society, 2004. IPP-BAS, Sofia, 10 May 2006

Research referenced in this talk (2): 8. Gashi I., Popov P., Strigini L., "Fault diversity among off-the-shelf SQL database servers", Proc. DSN 2004, International Conference on Dependable Systems and Networks,Florence, Italy, IEEE Computer Society Press, pp:389-398, 2004. 9. Gashi I., Popov P., Stankovic V., Strigini L., "On Designing Dependable Services with Diverse Off-The-Shelf SQL Servers", in "Architecting Dependable Systems II", Lecture Notes in Computer Science, (R. de Lemos, C. Gacek and A. Romanovsky, Eds.), vol. 3069, pp. 191-214, Springer-Verlag, 2004. 10. P. Popov, L. Strigini, A. Kostov, V. Mollov and D. Selensky, "Software Fault-Tolerance with Off-the-Shelf SQL Servers", Proc. 3rd International Conference on Component-Based Software Systems (ICCBSS'04), 2-4 Feb. 2004, Redondo Beach, CA, U.S.A., pp. 117-126, Springer, 2004. 11. Littlewood, B., Popov, P. and Strigini, L., "Design Diversity: an Update from Research on Reliability Modelling", Proc. Safety-Critical Systems Symposium 2001, Bristol, U.K., Springer, 2001. 12. P. Popov and L. Strigini, "The Reliability of Diverse Systems: a Contribution using Modelling of the Fault Creation Process", DSN 2001 - The International Conference on Dependable Systems and Networks, Goteborg, Sweden, 2001, 2001. IPP-BAS, Sofia, 10 May 2006

Peter Popov Centre for Software Reliability City University, United Kingdom ptp@csr.city.ac.uk

Peter Popov Centre for Software Reliability City University, United Kingdom ptp@csr.city.ac.uk

Presentation Transcript

UNITED KINGDOM

UNITED KINGDOM

United Kingdom Peter Taylor

United Kingdom Peter Taylor

United Kingdom

United Kingdom

United kingdom

United Kingdom

United Kingdom

United Kingdom

United Kingdom

United Kingdom

United Kingdom

United Kingdom

United Kingdom

United Kingdom:

Bob Sugden Centre for Software Reliability University of Newcastle Bob.Sugden@ncl.ac.uk

United kingdom

UNITED KINGDOM

United Kingdom

Best Corsham Sports Centre In United Kingdom For Learning

United Kingdom