HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK

HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 1 INTRODUCTION Wintersemester 2000/2001 Leitung: Prof. Dr. Miroslaw Malek www.informatik.hu-berlin.de/~rok/ftc DS - IX - NFT - 1

FAULT-TOLERANT COMPUTING SYSTEMSTopical Outline: • Introduction (Unit I) • Motivation • System views • Dependability rings • Dependable design methodology • Dependability Concepts, Measures and Models (UNIT DCMM) • Basic definitions • Dependability measures • Dependability models • Examples • Dependability evaluation tools • Testing Techniques (UNIT TT) • Testing techniques principles • Processor testing • Memory testing • Network testing DS - IX - NFT - 2

FAULT-TOLERANT COMPUTING SYSTEMSTopical Outline: • Fault Diagnosis Techniques (UNIT FST) • Fault detection techniques • Fault location (isolation) methods • Fault Recovery and Tolerance Techniques (UNIT FRTT) (System Level) • Dynamic techniques • Static techniques • Hybrid techniques • Fault-tolerant and Fault-secure Memories (UNIT FRTT) • Fault-tolerant techniques in manufacturing • Replication • Coding • Reconfiguration DS - IX - NFT - 3

FAULT-TOLERANT COMPUTING SYSTEMSTopical Outline: • Network Fault Tolerance (UNIT NFT) • Computer networks • Basic techniques • Example – multistage networks • Case Studies (UNIT CS) • ESS and 3B20 • FTMP – Fault-tolerant Multiprocessor • SIFT – Software-implemented Fault Tolerance • Communication controller • Fault-tolerant Building Block Architecture DS - IX - NFT - 4

COURSE ACTIVITIES • PROJECT • PRESENTATION • INVITED SPEAKERS • CONFERENCES AND WORKSHOPS • Some Websites: • www.dependability.org • www.paradise.caltech.edu • www.milan.eas.asu.edu • www.crhc.uiuc.edu DS - IX - NFT - 5

Major References on Fault-tolerant Computing (Books/General) 1 • Chang, H. Y., E.G. Manning and G. Metze, Fault Diagnosis in Digital Systems, Wiley –Interscience, 1970. • Friedman, A. D. and P. R. Menon, Fault Detection in Digital Circuits, Prentice-Hall, 1971. • Breuer, M. A. and A.D. Friedman, Diagnosis and Reliable Design of Digital Systems, Computer Science Press, 1976. • Kraft, G. D. and W. N. Toy, Microprogrammed Control and Reliable Design of Small Computers, Prentice-Hall, 1981. • Anderson, T. and P.A. Lee, Fault Tolerance Principles and Practice, Prentice-Hall, 1982. • Siewiorek, D.P. and R. S. Swarz, The Theory and Practice of Reliable Systems Design, Digital Press, 1982 & 1995. • Lala, P.K., Fault Tolerant and Fault Testable Hardware Design, Prentice-Hall International, 1985. • Pradhan, D. K. (ed.), Fault Tolerant Computing: Theory and Techniques, Vols. I and II, Prentice-Hall, 1986. DS - IX - NFT - 6

Major References on Fault-tolerant Computing (Books/General) 2 • Avizienis, A., H. Kopetz and J. C. Laprie (eds.), The Evolution of Fault-Tolerant Computing, Springer-Verlag, 1987. • Johnson, B. W., Design and Analysis of Fault Tolerant Digital Systems, Addison-Wesley, 1989. • Negrini, R., M. G. Sami and R. Stefanelli, Fault Tolerance Through Reconfiguration in VLSI and WSI Arrays, MIT Press, 1989. • Laprie, J. C. (ed.), Dependable computing and Fault-Tolerant Systems, Vol. 5: Dependability: Basic Concepts and Terminology, Springer-Verlag Wien New York, 1992. • Landwehr, C. E., B. Randell, L. Simoncini (eds.), Dependable Computing and Fault-Tolerant Systems, Vol. 8, Dependable Computing for Critical Applications 3, Springer-Verlag Wien New York, 1993. • Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Comp-uting, System Implementation, Kluwer Academic Publishers, 1994. • Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Comp-uting, Paradigms for Dependable Applications, Kluwer Academic Publishers, 1994. DS - IX - NFT - 7

Major References on Fault-tolerant Computing (Books/General) 3 • Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Comp-uting, Models and Frameworks for Dependable Systems, Kluwer Academic Publishers, 1994. • Malek, M. (ed.), Responsive Computing, Kluwer Acad. Publish., 1994. • Fussel, D. S. and M. Malek (eds.), Responsive Computer Systems, Steps Toward Fault-Tolerant Real-Time Systems, Kluwer Academic Publishers, 1995. • Cristian, F., G. Le Lann and T. Lunt (eds.), Dependable computing and Fault-Tolerant Systems, Vol. 9, Dependable Computing for Critical Applications 4, Springer-Verlag Wien New York, 1995. • Dhiraj K. Pradhan, Fault-Tolerant Computer System Design, Textbook Binding, 1996. • A. A. Shvartsman, Fault-Tolerant Parallel Computation, Kluwer, 1997 • W. Schneeweiss, Die Fehlerbaum-Methode, LiLoLe-Verlag, 1999 • S. Montenegro, Sichere und fehlertolerante Steuerungen, Hanser Muenchen, 1999. DS - IX - NFT - 8

Major References on Fault-tolerant Computing (Books/Reliability Evaluation) • Myers, G. J., Software Reliability Principles and Practice, Wiley-Interscience, 1976. • Trivedi, K. S., Probability and Statistics with Reliability Queuing and Computer Science Applications, Prentice-Hall, 1982. • Asche, H. and H. Feingold, Repairable Systems Reliability, Marcel Dekker, 1984. • Musa, J. D., A. Iannino and K. Okumoto, Software Reliability: Measurement, Prediction, Application, McGraw-Hill, 1987. • W. Schneeweiss, Petri Nets for Reliability Modeling, LiLoLe, 1999 DS - IX - NFT - 9

Major References on Fault-tolerant Computing (Books/Coding) • Sellers, E. F., M. Y. Hsiao and L. W. Bearnson, Error Detecting Logic for Digital Computers, McGraw-Hill, 1968. • Peterson, W. and E. Welding, Error-Correcting Codes (2nd ed.), MIT Press, 1972. • Wakerly, J., Errors Detecting Codes, Self-Checking Circuits and Applications, The Computer Science Library, 1978. • Lin, S. and D. J. Castello, Error Control Coding: Fundamentals and Application, Prentice-Hall, 1983. • Nagle, H. T., J. D. Irwin and D. Hoffman, Error Detecting and Correcting Codes for Computer Scientist and Engineers, MacMillan Publishers, 1986. • Rao, T. R. N. and E. Fujiwara, Error-Control Coding for Computer Systems, Prentice-Hall, 1989. DS - IX - NFT - 10

Major References on Fault-tolerant Computing (Books/Software) • Myers, G. J., The Art of Software Testing, Wiley-Interscience, 1970. • Deutsch, M. D., Software Verification and Validation, Prent.-Hall, 1982. • Shooman, M. L., Software Engineering, McGraw-Hill, 1983. • Beizer, B., Software Testing Techniques, Van Nostrand Reinhold, 1983. • Bernstein, P. A., V. Hadzlacos and N. Goodman, Concurrency Control and Recovery in Database Systems, Addison-Wesley, 1987. • Neufelder, A. M., Earning Software Reliability, Marcel Dekker Inc., 1993. • Lyu, M. R. (ed.), Software Fault Tolerance, John Wiley and Sons, 1995. • Lyu, M. R. (ed.), Handbook of Software Reliability Engineering, Computer Science Press, 1995. DS - IX - NFT - 11

Major References on Fault-tolerant Computing (Journals) • Special Issue of Proc. Of IEEE, October 1978 • Special Issue of Computer, October 1979 • Special Issue of Computer, March 1980 • Special Issue of Computer, August 1984 • Special Issue of IEEE Software, May 1995 • IEEE Trans. on Reliability • IEEE Trans. On Software Engineering • Computer • Design and Test • Electronics • Proc. Of IEEE • Computer Design • Journal of Electronic Testing: Theory and Applications • Journal of Parallel and Distributed Computing • IEEE Trans. on Parallel and Distributed Computing • Real-Time Systems Journal DS - IX - NFT - 12

Major References on Fault-tolerant Computing (Conference Proceedings) • Fault-Tolerant Computing Symposium • Reliability and Maintainability Symposium • Reliability in Distributed Software and Database Systems Symposium • Test Conference • Distributed Computing Systems Conference • Parallel Processing Conference • Real-Time Systems Symposium • Computer Architecture Symposium DS - IX - NFT - 13

INTRODUCTION • OBJECTIVES: • MOTIVATION FOR FAULT-TOLERANT SYSTEMS • TO INTRODUCE VARIOUS VIEWS OF COMPUTER SYSTEMS AND THEIR RELATIONS TO COMPUTER SYSTEM DEPENDABILITY • TO PRESENT BASIC CONCEPTS AND APPROACHES • TO INTRODUCE DEPENDABLE DESIGN METHODOLOGY • CONTENTS: • MOTIVATION • SYSTEM VIEWS • SYSTEM DEPENDABILITY CONCEPTS • APPROACHES TO DEPENDABLE DESIGN • DEPENDABILITY RINGS • DEPENDABLE DESIGN METHODOLOGY DS - IX - NFT - 14

TYPES OF SYSTEMS • Dependable (Reliable) System • A system which delivers a required service during its lifetime • Fault-Tolerant Computer Systems • A system that has the capability to continue the correct execution of its programs and input/output functions in the presence of faults • Real-Time-Computer Systems • are the ones that deliver service to a user within a specified deadline (physical time, duration, etc.) • Responsive Computer System • are Fault-Tolerant Real-Time Systems that deliver satisfactory service in a timely manner DS - IX - NFT - 15

MOTIVATION FOR RELIABLE AND FAULT-TOLERANT COMPUTING • ECONOMIC NECESSITY • LIFE SAVING • NOVICE USERS • HARSH ENVIRONMENTS • MORE COMPLEX SYSTEMS DS - IX - NFT - 16

DEVICE RELIABILITY AND SYSTEM RELIABILITY Equivalent – Device Reliability 106 105 104 103 102 10 1 Mean Time between Failures (MTBF) in Years Minimum Acceptable Reliability System Reliability 1950 1960 1970 1980 1990 Relays – Vacuum Tubes – Semiconductors – SSI – MSI – LSI - VLSI DS - IX - NFT - 17

DEPENDABILITY – PERFORMANCE TRADE-OFF Ultra Reliable Systems 0.99999 0.9999 0.999 0.99 0.9 Commercial Fault-Tolerant Systems Availability Massively Parallel/ Distributed Systems 1 10 100 1000 10000 100000 Throughput (MIPS) DS - IX - NFT - 18

EXAMPLES • DEFENSE SYSTEMS • FLIGHT SYSTEMS • AIR TRAFFIC CONTROL • COMMUNICATION SYSTEMS • BANKING SYSTEMS • AIRLINE SEAT RESERVATIONS • TELEPHONE SYSTEMS • HOUSEHOLD APPLIANCES • VIDEO GAMES DS - IX - NFT - 19

VIEW 1: SYSTEM LIFE CYCLE SYSTEM CONSTRAINTS NEW TECHNOLOGY OBSOLESCENCE NEEDS CONCEPT FORMULATION SYSTEM SPECIFICATION DESIGN PROTOTYPE PRODUCTION INSTALLATION OPERATIONAL LIFE MODIFICATION AND RETIREMENT • Notice that testing, verification or validation should occur after every phase of life cycle • Very few tools exist, and for some steps of the cycle only DS - IX - NFT - 20

VIEW 2: PACKAGING LEVELS OF INTEGRATION • APPLICATIONS • APPLICATIONS MODULES • SPECIAL-PURPOSE LANGUAGES • STANDARD LANGUAGES • OPERATING SYSTEMS • CABINETS/FRAMES • BOXES/CAGES • PRINTED CIRCUIT BOARDS/CARDS, WAFERS, TCMs • INTEGRATED CIRCUITS (CHIPS) • Dependability must be considered at every level • System decomposition (partitioning) may have a significant impact on dependability DS - IX - NFT - 21

VIEW 3: WORKLOAD VIEW LIVEWARE USEFUL WORK PREPARATION SEMI USEFUL WORK HARDWARE/ SOFTWARE IDLING FAULT SERVICING • ELIMINATE IDLING AND USE IT FOR TESTING TO IMPROVE DEPENDABILITY DS - IX - NFT - 22

LEVEL SUBLEVEL COMPONENTS PMS Processors, Memories, Switches, Links (Networks), Controllers, ALUs, I/Os Program HLL, ISP (Inst- raction Set Processor Software, Memory State, Processor State, Effective Address Calculation, Instruction Decode, Instruction Execution Logic Register Trans- fer Level (RTL) Data Paths, Registers, Data Operators, Control (Hardwired), Microprogramming (Microstore) Circuit Resistors, Capacitors, Inductors, Power Sources, Diodes Transistors Quantum & El-ectromagnetic Disks, Tapes VIEW 4: LEVELS OF ABSTRACTION FOR DIGITAL COMPUTERS • DEPENDABILITY AND TESTING MUST BE CONSIDERED AT EVERY LEVEL DS - IX - NFT - 23

VIEW 5: COMPUTER SYSTEM LIVEWARE MAINTENANCE PERSONNEL OPERATORS SYSTEM DESIGNERS SYSTEM ANALYSTS PROGRAMMERS USERS SOFTWARE PACKAGES ASSEMBLERS COMPILERS OPERATING SYSTEMS UTILITY PROGRAMS DEBUGGING PROGRAMS FILE PROCESSING PROGRAMS FIRMWARE MICROPROGRAM & MICROPRO- GRAMMING SYSTEMS HARDWARE CPUs I/O DEVICES MEMORIES INTERCONNECTION NETWORKS FAULTS ARE ATTRIBUTED TO: HARDWARE: 20%-65%; SOFTWARE: 20%-80%; PEOPLE: 15%-40%; AT&T’s: 20-40-40%; (2/3 applications + 1/3 OS) DS - IX - NFT - 24

(WARNING!!!)VIEW 6: IF YOU DO NOT FOLLOW DEPENDABLE DESIGN METHODOLOGY YOU MAY END UP WITH THE FOLLOWING: SIX PHASES OF A PROJECT • ENTHUSIASM • DISILLUSIONMENT • PANIC AND HYSTERIA • SEARCH FOR THE GUILTY • PUNISHMENT OF THE INNOCENT • PRAISE AND AWARDS FOR THE NON-PARTICIPANTS (Author unknown – found in one of the computer companies) DS - IX - NFT - 25

SYSTEM DEPENDABILITY CONCEPTS • RELIABILITY • Is a conditional probability that the system will perform its intended function without failure at time t provided it was fully operational at time t = 0 • AVAILABILITY • Instantaneous availability is the probability that a system is performing correctly at time t and is equal to reliability of non-repairable systems A (t) = R (t) • Steady-state availability is the probability that a system will be operational at any random point of time and is expressed as the fraction of time a system is operational during its expected lifetime As (t) = • SURVIVABILITY is the probability that a system will deliver the required service in the presence of a defined a priori set of faults or any of its subset DS - IX - NFT - 26

APPROACHES • FAULT INTOLERANCE • FAULT TOLERANCE • MAINTAINABILITY • HARDWARE/SOFTWARE TRADE-OFFS DS - IX - NFT - 27

HARDWARE/SOFTWARE CONTINUUM AND VERTICAL MIGRATION HARDWARE EXAMPLES M6800 MC68000 VAX-11/780 IBM-30XX CRAY-XMP C-205 SYSTOLIC ARRAYS, RECONFIGURABLE OR EXPERIMENTAL MULTICOMPUTERS INSTRUCTIONS INTEGER ARITHMETIC ADD/SUB MPY/DIV FLOATING-POINT ARITHMETIC VECTOR PROCESSING MULTIPROCESSING (e.g., submachine set-up) SOFTWARE VERTICAL MIGRATION is a transfer of functions’ implementation from software to firmware and/or hardware or vice-versa. Vertical Migration improves performance and dependability, and reduces cost. DS - IX - NFT - 28

DEPENDABILITY (RELIABILITY) RINGS FOR FAULT TOLERANCE Dependability Rings Acceptance Test Operating System, Languages and Application Acceptance Test System Hardware Acceptance Test Register-Transfer Level Acceptance Test Logic Level Each Dependability Ring should provide measures and mechanisms for Fault Tolerance (Detection, Location, Testability and Recovery) DS - IX - NFT - 29

A BOOTSTRAP – TEST RINGS IN A MULTICOMPUTER SYSTEM Network Memories Processor Diagnostic and Maintenance Processor (s) (Hardcore) Test Rings DS - IX - NFT - 30

DEPENDABLE DESIGN METHODOLOGY • Identify fault classes, fault latency and fault impact • Determine qualitative and quantitative specs for fault tolerance and evaluate your design in specific environment • Identify “weak spots” and assess potential damage • Decompose the system • Develop fault and error detection techniques and algorithms • Develop fault isolation techniques and algorithms • Develop recovery/reintegration/restart • Evaluate degree of fault tolerance • Refine, iterate for improvement; try to eliminate “weak spots” and minimize potential damage DS - IX - NFT - 31

REAL-TIME SYSTEMS DESIGN • Identify time/critical tasks and specify their timing (deadlines, durations, frequency, periodicity, if any). Characterize the system load and environment. • Characterize timing of a system (hardware and software). • Map timing specification onto a system timing (find the best resource allocation and scheduling methods), and incorporate concurrent monitoring. • Verify and validate the design for quantitative and qualitative specifications. • Refine, iterate and fine-tune the design. DS - IX - NFT - 32

RESPONSIVE SYSTEM DESIGN • Determine qualitative and quantitative specifications for fault tolerance and task timeliness which meet user requirements. • Determine system timing (hardware and software) assess damage, availability and responsiveness. • Develop and time fault and error detection techniques and algorithms. • Develop and time fault isolation techniques and algorithms. • Develop time recovery/reintegration/restart. • Map timing specification onto system timing under appropriate assumptions and incorporate concurrent monitoring. • Evaluate responsiveness. • Refine and iterate for improvement. RESPONSIVE SYSTEMS NEED ARCHITECTS OF SPACE AND ARCHITECTS OF TIME DS - IX - NFT - 33

REFERENCES(TEXTBOOK) • C. G. Bell, J. C. Mudge and J. E. McNamara “Seven Views of Computer Systems”, Chapter 1 in the book by the same authors titled “Computer Engineering”, Digital Press, 1978. • G.J. Lipovski and M. Malek, “Parallel Computing: Theory and Comparisons”, Wiley-Interscience, New York, 1987. • M. Malek, “Parallel Computer Systems Testing and Integration”, in the book titled “Testing and Diagnosis of VLSI and LSI”, M. G. Sami and F. Lombardi (eds.), Kluwer, 1988. • Pankaj Jalote, Fault Tolerance in Distributed Systems / Textbook Binding / Published 1994 • Dhiraj K. Pradhan, Fault-Tolerant Computer System Design, Textbook Binding, 1996. DS - IX - NFT - 34

HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK