Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance

Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok Southern Methodist University US/UK Workshop on Network-Centric Operation and Network Enabled Capability, Washington, D.C., July 24-25, 2008

Outline • Overall Framework • External Environment Profiling • Component Dependability: • Direct Measurement and Assessment • Indirect Assessment via Internal Contributor Mapping • Value Perspective • Experimental Evaluation • Fault Injection for Reliability and Fault Tolerance • Security Threat Simulation • Summary and Future Work US/UK NCO/NEC Workshop

Overall Framework • Systems made up of different components • Many factors contribute to system dependability • Our focus: Diversity of individual components • Component strength/weakness/diversity: • Target: Different dependability attributes and sub-attributes • External reference: Operational profile (OP) • Internal assessment: Contributors to dependability • Value perspective: Relative importance and trade-off • Maximize diversity => Maximize dependability • Combine strength • Avoid/complement/tolerate flaws/weaknesses US/UK NCO/NEC Workshop

Overall Framework (2) • Diversity: Four Perspectives • Environmental perspective: Operational profile (OP) • Target perspective: Goal, requirement • Internal contributor perspective: Internal characteristics • Value perspective: Customer • Achieving diversity and fault tolerance: • Component evaluation matrix per target per OP • Multidimensional evaluation/composition via DEA (Data Envelopment Analysis) • Internal contributor to dependability mapping • Value-based evaluation using single objective function US/UK NCO/NEC Workshop

Terminology • Quality and dependability are typically defined in terms of conformance to customer’s expectations and requirements • Key concepts: defect, failure, fault, and error • Dependability: the focus in this presentation • Key attributes: reliability, security, etc. • Defect = some problem with the software • either with its external behavior • or with its internal characteristics US/UK NCO/NEC Workshop

Failure, Fault, Error • IEEE STD 610.12 terms related to defect: • Failure: The inability of a system or component to perform its required functions within specified requirements • Fault: An incorrect step, process, or data definition in a computer program • Error: A human action that produces an incorrect result • Errors may cause faults to be injected into the software • Faults may cause failures when the software is executed US/UK NCO/NEC Workshop

Reliability and Other Dependability Attributes • Software reliability = the probability for failure-free operation of a program for a specified time under a specified set of operating conditions (Lyu, 1995; Musa et al., 1987) • Estimated according to various model based on defect and time/input measurements • Standard definitions for other dependability attributes, such as security, fault tolerance, availability, etc. US/UK NCO/NEC Workshop

Diversity: Environmental Perspective • Dependability defined for a specific environment • Stationary vs dynamic usage environments • Static, uniform, or stationary (reached an equilibrium) • Dynamic, changing, evolving, with possible unanticipated changes or disturbances • Single/overall OP for former category • Musa or Markov variation • Single evaluation result possible per component per dependability attribute: e.g., component reliability R(i) • Environment Profiling for Individual Components • Environmental snapshots captured in Musa or Markov Ops • Evaluation matrix (later) US/UK NCO/NEC Workshop

Operational Profile (OP) • Operational profile (OP) is a list of disjoint set of operations and their associated probabilities of occurrence (Musa 1998) • OP describes how users use an application: • Help guide the allocation of test cases in accordance with use • Ensure that the most frequent operations will receive more testing • As the context for realistic reliability evaluation • Other usages, including diversity and internal-external mapping in this presentation US/UK NCO/NEC Workshop

Markov Chain Usage Model • Markov chain usage model is a set of states, transitions, and the transition probabilities • As an alternative to Musa (flat) OP • Each link has an associated probability of occurrence • Models complex and/or interactive systems better • Unified Markov Models(Kallepalli and Tian, 2001; Tian et al., 2003): • Collection of Markov Ops in a hierarchy • Flexible application in testing and reliability improvement US/UK NCO/NEC Workshop

Operational Profile Development:Standard Procedure • Musa’s steps (1998) for OP construction: • Identify the initiators of operations • Choose a representation (tabular or graphical) • Create an operations “list” • Establish the occurrence rates of the individual operations • Establish the occurrence probabilities • Other variations • Original Musa (1993): 5 top-down refinement steps • Markov OP (Tian et al): FSM then probabilities based on log files US/UK NCO/NEC Workshop

OPs for Composite Systems • Using standard procedure whenever possible • For overall stationary environment • For individual component usage => component OP • For dynamic environment: • Snapshot identification • Sets of OPs for each snapshot • System OP from individual component OPs • Special considerations: • Existing test data or operational logs can be used to develop component OPs • Union of component OPs => system OP US/UK NCO/NEC Workshop

OP and Dependability Evaluation • Some dependability attributes defined with respect to a specific OP: e.g., reliability • For overall stationary environment: direct measurement and assessment possible • For dynamic environment: OP-reliability pairs • Consequence of improper reuse due to different OPs (Weyuker 1998) • From component to system dependability: • Customization/selection of best-fit OP for estimation • Compositional approach (Hamlet et al, 2001) US/UK NCO/NEC Workshop

Diversity: Target Perspective • Component Dependability: • Component reliability, security, etc. to be scored/evaluated • Direct Measurement and Assessment • Indirect Assessment (later) • Under stationary environment: • Dependability vector for each component • Diversity maximization via DEA (data envelopment analysis) • Under dynamic environment: • Dependability matrix for each component • Diversity maximization via extended DEA by flattening out the matrix US/UK NCO/NEC Workshop

Diversity Maximization via DEA • DEA (data envelopment analysis): • Non-parametric analysis • Establishes a multivariate frontier in a dataset • Basis: linear programming • Applying DEA • Dependability attribute frontier • Illustrative example (right) • N-dimensional: hyperplane US/UK NCO/NEC Workshop

Inputs Outputs Efficiency Output/Input • Software Reliability At Release • Defect Density after test • Software Productivity • Labor hours • Software Change Size DEA Example • Lockheed-Martin software project performance with regard to selected metrics and production efficiency model • Measures efficiencies of decision making units (DMU) using weighted sums of inputs and weighted sums of outputs • Compares DMUs to each other • Sensitivity analysis affords study of non-efficient DMUs in comparison • BCC VRS Model used in initial study US/UK NCO/NEC Workshop

DEA Example (2) • Using production efficiency model for Compute-Intensive dataset group • Ranked set of projects • Data showing distance and direction from efficiency frontier US/UK NCO/NEC Workshop

Diversity: Internal Perspective • Component Dependability: • Direct Measurement and Assessment: might not be available, feasible, or cost-effective • Indirect Assessment via Internal Contributor Mapping • Internal Contributors: • System design, architecture • Component internal characteristics: size, complexity, etc. • Process/people/other characteristics • Usually more readily available data/measurements • Internal=>External mapping • Procedure with OP as input too (e.g., fault=>reliability) US/UK NCO/NEC Workshop

Example: Fault-Failure Mapping for Dynamic Web Applications US/UK NCO/NEC Workshop

Web Example: Fault-Failure Mapping • Input to analysis (and fault-failure conversion): • Anomalies recorded in web server logs (failure view) • Faults recorded during development and maintenance • Defect impact scheme (weights) • Operational profile • Product “A” is an ordering web application for telecom services • Consists of hundreds of thousands of lines of code • Running on IIS 6.0 (Microsoft Internet Information Server), • Process couple of millions requests per day US/UK NCO/NEC Workshop

Web Example: Fault-Failure Mapping (Step 1) • Pareto chart for the defect classification of product “A” • The top three categories represent 66.26% of the total defect data US/UK NCO/NEC Workshop

Web Example: Fault-Failure Mapping (Steps 4 & 5) • OP for product “A” and the corresponding numbers of transactions. US/UK NCO/NEC Workshop

Web Example: Fault-Failure Mapping (Step 6) • Using the number of transactions calculated from OP and the defined fault impact schema, we calculated the fault exposure or corresponding potential failure frequencies US/UK NCO/NEC Workshop

Web Example: Fault-Failure Mapping (Step 7) US/UK NCO/NEC Workshop

Web Example: Fault-Failure Mapping (Result Analysis) • A large number of failures were caused by a small number of errors with high usage frequencies • Fixing faults with a high usage frequency and a high impact could achieve better efficiency in reliability improvement • By fixing the top 6.8% faults, the total failures were reduced by about 57% • Similarly, 10% -> 66%, 15%->71%, 20%->75%, for top-faults induced failure reduction • Defect data repository and web server log recorded failures have insignificant overlap => both are needed for effective reliability improvement US/UK NCO/NEC Workshop

Diversity: Value Perspective • Component Dependability Attribute: • Direct Measurement and Assessment: might not capture what customers truly care about • Different value attached to different dependability attributes • Value-based software quality analysis: • Quantitative model for software dependability ROI analysis • Avoid one-size-fits-all • Value-based process: experience at NASA/USC (Huang and Boehm) extend to dependability • Mapping to value-based perspective more meaningful to target customers US/UK NCO/NEC Workshop

Value Maximization • Single objective function: • Relative importance • Trade-off possible • Quantification scheme • Gradient scale to selecte component(s) • Compare to DEA • General cases • Combination with DEA • Diversity as a separate dimension possible US/UK NCO/NEC Workshop

Experimental Evaluation • Testbed • Basis: OPs • Focus on problems and system behavior under injected or simulated problems • Fault Injection for Reliability and Fault Tolerance • Reliability mapping for injected faults • Use of fault seeding models • Direct fault tolerance evaluation • Security Threat Simulation • Focus 1: likely scenarios • Focus 2: coverage via diversity US/UK NCO/NEC Workshop

Summary and Future Work • Overall Framework • External Environment Profiling • Component Dependability: • Direct Measurement and Assessment • Indirect Assessment via Internal Contributor Mapping • Value Perspective • Experimental Evaluation • Fault Injection for Reliability and Fault Tolerance • Security Threat Simulation • Summary and Future Work US/UK NCO/NEC Workshop

Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance

Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance

Presentation Transcript

Fault Tolerance in Distributed Systems

Fault Tolerance

Fault Tolerance in Distributed Systems

Fault Tolerance in Distributed Systems

Fault Tolerance in Embedded Systems

Fault Tolerance

Fault tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault-tolerance in Component-based Systems

Fault Tolerance and the Common Component Architecture

Developing Dependable Systems

Design of Self-Managing Dependable Systems with UML and Fault Tolerance Patterns

Fault Tolerance in Distributed Systems

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance in Distributed Systems

Fault Tolerance in Distributed Systems