1 / 34

HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK

HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK. DEPENDABLE SYSTEMS Vorlesung 1 INTRODUCTION Wintersemester 2000 /2001 Leitung: Prof. Dr. Miroslaw Malek www.informatik.hu-berlin.de/~rok/ftc. FAULT-TOLERANT COMPUTING SYSTEMS Topical Outline:. Introduction (Unit I) Motivation

adler
Télécharger la présentation

HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 1 INTRODUCTION Wintersemester 2000/2001 Leitung: Prof. Dr. Miroslaw Malek www.informatik.hu-berlin.de/~rok/ftc DS - IX - NFT - 1

  2. FAULT-TOLERANT COMPUTING SYSTEMSTopical Outline: • Introduction (Unit I) • Motivation • System views • Dependability rings • Dependable design methodology • Dependability Concepts, Measures and Models (UNIT DCMM) • Basic definitions • Dependability measures • Dependability models • Examples • Dependability evaluation tools • Testing Techniques (UNIT TT) • Testing techniques principles • Processor testing • Memory testing • Network testing DS - IX - NFT - 2

  3. FAULT-TOLERANT COMPUTING SYSTEMSTopical Outline: • Fault Diagnosis Techniques (UNIT FST) • Fault detection techniques • Fault location (isolation) methods • Fault Recovery and Tolerance Techniques (UNIT FRTT) (System Level) • Dynamic techniques • Static techniques • Hybrid techniques • Fault-tolerant and Fault-secure Memories (UNIT FRTT) • Fault-tolerant techniques in manufacturing • Replication • Coding • Reconfiguration DS - IX - NFT - 3

  4. FAULT-TOLERANT COMPUTING SYSTEMSTopical Outline: • Network Fault Tolerance (UNIT NFT) • Computer networks • Basic techniques • Example – multistage networks • Case Studies (UNIT CS) • ESS and 3B20 • FTMP – Fault-tolerant Multiprocessor • SIFT – Software-implemented Fault Tolerance • Communication controller • Fault-tolerant Building Block Architecture DS - IX - NFT - 4

  5. COURSE ACTIVITIES • PROJECT • PRESENTATION • INVITED SPEAKERS • CONFERENCES AND WORKSHOPS • Some Websites: • www.dependability.org • www.paradise.caltech.edu • www.milan.eas.asu.edu • www.crhc.uiuc.edu DS - IX - NFT - 5

  6. Major References on Fault-tolerant Computing (Books/General) 1 • Chang, H. Y., E.G. Manning and G. Metze, Fault Diagnosis in Digital Systems, Wiley –Interscience, 1970. • Friedman, A. D. and P. R. Menon, Fault Detection in Digital Circuits, Prentice-Hall, 1971. • Breuer, M. A. and A.D. Friedman, Diagnosis and Reliable Design of Digital Systems, Computer Science Press, 1976. • Kraft, G. D. and W. N. Toy, Microprogrammed Control and Reliable Design of Small Computers, Prentice-Hall, 1981. • Anderson, T. and P.A. Lee, Fault Tolerance Principles and Practice, Prentice-Hall, 1982. • Siewiorek, D.P. and R. S. Swarz, The Theory and Practice of Reliable Systems Design, Digital Press, 1982 & 1995. • Lala, P.K., Fault Tolerant and Fault Testable Hardware Design, Prentice-Hall International, 1985. • Pradhan, D. K. (ed.), Fault Tolerant Computing: Theory and Techniques, Vols. I and II, Prentice-Hall, 1986. DS - IX - NFT - 6

  7. Major References on Fault-tolerant Computing (Books/General) 2 • Avizienis, A., H. Kopetz and J. C. Laprie (eds.), The Evolution of Fault-Tolerant Computing, Springer-Verlag, 1987. • Johnson, B. W., Design and Analysis of Fault Tolerant Digital Systems, Addison-Wesley, 1989. • Negrini, R., M. G. Sami and R. Stefanelli, Fault Tolerance Through Reconfiguration in VLSI and WSI Arrays, MIT Press, 1989. • Laprie, J. C. (ed.), Dependable computing and Fault-Tolerant Systems, Vol. 5: Dependability: Basic Concepts and Terminology, Springer-Verlag Wien New York, 1992. • Landwehr, C. E., B. Randell, L. Simoncini (eds.), Dependable Computing and Fault-Tolerant Systems, Vol. 8, Dependable Computing for Critical Applications 3, Springer-Verlag Wien New York, 1993. • Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Comp-uting, System Implementation, Kluwer Academic Publishers, 1994. • Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Comp-uting, Paradigms for Dependable Applications, Kluwer Academic Publishers, 1994. DS - IX - NFT - 7

  8. Major References on Fault-tolerant Computing (Books/General) 3 • Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Comp-uting, Models and Frameworks for Dependable Systems, Kluwer Academic Publishers, 1994. • Malek, M. (ed.), Responsive Computing, Kluwer Acad. Publish., 1994. • Fussel, D. S. and M. Malek (eds.), Responsive Computer Systems, Steps Toward Fault-Tolerant Real-Time Systems, Kluwer Academic Publishers, 1995. • Cristian, F., G. Le Lann and T. Lunt (eds.), Dependable computing and Fault-Tolerant Systems, Vol. 9, Dependable Computing for Critical Applications 4, Springer-Verlag Wien New York, 1995. • Dhiraj K. Pradhan, Fault-Tolerant Computer System Design, Textbook Binding, 1996. • A. A. Shvartsman, Fault-Tolerant Parallel Computation, Kluwer, 1997 • W. Schneeweiss, Die Fehlerbaum-Methode, LiLoLe-Verlag, 1999 • S. Montenegro, Sichere und fehlertolerante Steuerungen, Hanser Muenchen, 1999. DS - IX - NFT - 8

  9. Major References on Fault-tolerant Computing (Books/Reliability Evaluation) • Myers, G. J., Software Reliability Principles and Practice, Wiley-Interscience, 1976. • Trivedi, K. S., Probability and Statistics with Reliability Queuing and Computer Science Applications, Prentice-Hall, 1982. • Asche, H. and H. Feingold, Repairable Systems Reliability, Marcel Dekker, 1984. • Musa, J. D., A. Iannino and K. Okumoto, Software Reliability: Measurement, Prediction, Application, McGraw-Hill, 1987. • W. Schneeweiss, Petri Nets for Reliability Modeling, LiLoLe, 1999 DS - IX - NFT - 9

  10. Major References on Fault-tolerant Computing (Books/Coding) • Sellers, E. F., M. Y. Hsiao and L. W. Bearnson, Error Detecting Logic for Digital Computers, McGraw-Hill, 1968. • Peterson, W. and E. Welding, Error-Correcting Codes (2nd ed.), MIT Press, 1972. • Wakerly, J., Errors Detecting Codes, Self-Checking Circuits and Applications, The Computer Science Library, 1978. • Lin, S. and D. J. Castello, Error Control Coding: Fundamentals and Application, Prentice-Hall, 1983. • Nagle, H. T., J. D. Irwin and D. Hoffman, Error Detecting and Correcting Codes for Computer Scientist and Engineers, MacMillan Publishers, 1986. • Rao, T. R. N. and E. Fujiwara, Error-Control Coding for Computer Systems, Prentice-Hall, 1989. DS - IX - NFT - 10

  11. Major References on Fault-tolerant Computing (Books/Software) • Myers, G. J., The Art of Software Testing, Wiley-Interscience, 1970. • Deutsch, M. D., Software Verification and Validation, Prent.-Hall, 1982. • Shooman, M. L., Software Engineering, McGraw-Hill, 1983. • Beizer, B., Software Testing Techniques, Van Nostrand Reinhold, 1983. • Bernstein, P. A., V. Hadzlacos and N. Goodman, Concurrency Control and Recovery in Database Systems, Addison-Wesley, 1987. • Neufelder, A. M., Earning Software Reliability, Marcel Dekker Inc., 1993. • Lyu, M. R. (ed.), Software Fault Tolerance, John Wiley and Sons, 1995. • Lyu, M. R. (ed.), Handbook of Software Reliability Engineering, Computer Science Press, 1995. DS - IX - NFT - 11

  12. Major References on Fault-tolerant Computing (Journals) • Special Issue of Proc. Of IEEE, October 1978 • Special Issue of Computer, October 1979 • Special Issue of Computer, March 1980 • Special Issue of Computer, August 1984 • Special Issue of IEEE Software, May 1995 • IEEE Trans. on Reliability • IEEE Trans. On Software Engineering • Computer • Design and Test • Electronics • Proc. Of IEEE • Computer Design • Journal of Electronic Testing: Theory and Applications • Journal of Parallel and Distributed Computing • IEEE Trans. on Parallel and Distributed Computing • Real-Time Systems Journal DS - IX - NFT - 12

  13. Major References on Fault-tolerant Computing (Conference Proceedings) • Fault-Tolerant Computing Symposium • Reliability and Maintainability Symposium • Reliability in Distributed Software and Database Systems Symposium • Test Conference • Distributed Computing Systems Conference • Parallel Processing Conference • Real-Time Systems Symposium • Computer Architecture Symposium DS - IX - NFT - 13

  14. INTRODUCTION • OBJECTIVES: • MOTIVATION FOR FAULT-TOLERANT SYSTEMS • TO INTRODUCE VARIOUS VIEWS OF COMPUTER SYSTEMS AND THEIR RELATIONS TO COMPUTER SYSTEM DEPENDABILITY • TO PRESENT BASIC CONCEPTS AND APPROACHES • TO INTRODUCE DEPENDABLE DESIGN METHODOLOGY • CONTENTS: • MOTIVATION • SYSTEM VIEWS • SYSTEM DEPENDABILITY CONCEPTS • APPROACHES TO DEPENDABLE DESIGN • DEPENDABILITY RINGS • DEPENDABLE DESIGN METHODOLOGY DS - IX - NFT - 14

  15. TYPES OF SYSTEMS • Dependable (Reliable) System • A system which delivers a required service during its lifetime • Fault-Tolerant Computer Systems • A system that has the capability to continue the correct execution of its programs and input/output functions in the presence of faults • Real-Time-Computer Systems • are the ones that deliver service to a user within a specified deadline (physical time, duration, etc.) • Responsive Computer System • are Fault-Tolerant Real-Time Systems that deliver satisfactory service in a timely manner DS - IX - NFT - 15

  16. MOTIVATION FOR RELIABLE AND FAULT-TOLERANT COMPUTING • ECONOMIC NECESSITY • LIFE SAVING • NOVICE USERS • HARSH ENVIRONMENTS • MORE COMPLEX SYSTEMS DS - IX - NFT - 16

  17. DEVICE RELIABILITY AND SYSTEM RELIABILITY Equivalent – Device Reliability 106 105 104 103 102 10 1 Mean Time between Failures (MTBF) in Years Minimum Acceptable Reliability System Reliability 1950 1960 1970 1980 1990 Relays – Vacuum Tubes – Semiconductors – SSI – MSI – LSI - VLSI DS - IX - NFT - 17

  18. DEPENDABILITY – PERFORMANCE TRADE-OFF Ultra Reliable Systems 0.99999 0.9999 0.999 0.99 0.9 Commercial Fault-Tolerant Systems Availability Massively Parallel/ Distributed Systems 1 10 100 1000 10000 100000 Throughput (MIPS) DS - IX - NFT - 18

  19. EXAMPLES • DEFENSE SYSTEMS • FLIGHT SYSTEMS • AIR TRAFFIC CONTROL • COMMUNICATION SYSTEMS • BANKING SYSTEMS • AIRLINE SEAT RESERVATIONS • TELEPHONE SYSTEMS • HOUSEHOLD APPLIANCES • VIDEO GAMES DS - IX - NFT - 19

  20. VIEW 1: SYSTEM LIFE CYCLE SYSTEM CONSTRAINTS NEW TECHNOLOGY OBSOLESCENCE NEEDS CONCEPT FORMULATION SYSTEM SPECIFICATION DESIGN PROTOTYPE PRODUCTION INSTALLATION OPERATIONAL LIFE MODIFICATION AND RETIREMENT • Notice that testing, verification or validation should occur after every phase of life cycle • Very few tools exist, and for some steps of the cycle only DS - IX - NFT - 20

  21. VIEW 2: PACKAGING LEVELS OF INTEGRATION • APPLICATIONS • APPLICATIONS MODULES • SPECIAL-PURPOSE LANGUAGES • STANDARD LANGUAGES • OPERATING SYSTEMS • CABINETS/FRAMES • BOXES/CAGES • PRINTED CIRCUIT BOARDS/CARDS, WAFERS, TCMs • INTEGRATED CIRCUITS (CHIPS) • Dependability must be considered at every level • System decomposition (partitioning) may have a significant impact on dependability DS - IX - NFT - 21

  22. VIEW 3: WORKLOAD VIEW LIVEWARE USEFUL WORK PREPARATION SEMI USEFUL WORK HARDWARE/ SOFTWARE IDLING FAULT SERVICING • ELIMINATE IDLING AND USE IT FOR TESTING TO IMPROVE DEPENDABILITY DS - IX - NFT - 22

  23. LEVEL SUBLEVEL COMPONENTS PMS Processors, Memories, Switches, Links (Networks), Controllers, ALUs, I/Os Program HLL, ISP (Inst- raction Set Processor Software, Memory State, Processor State, Effective Address Calculation, Instruction Decode, Instruction Execution Logic Register Trans- fer Level (RTL) Data Paths, Registers, Data Operators, Control (Hardwired), Microprogramming (Microstore) Circuit Resistors, Capacitors, Inductors, Power Sources, Diodes Transistors Quantum & El-ectromagnetic Disks, Tapes VIEW 4: LEVELS OF ABSTRACTION FOR DIGITAL COMPUTERS • DEPENDABILITY AND TESTING MUST BE CONSIDERED AT EVERY LEVEL DS - IX - NFT - 23

  24. VIEW 5: COMPUTER SYSTEM LIVEWARE MAINTENANCE PERSONNEL OPERATORS SYSTEM DESIGNERS SYSTEM ANALYSTS PROGRAMMERS USERS SOFTWARE PACKAGES ASSEMBLERS COMPILERS OPERATING SYSTEMS UTILITY PROGRAMS DEBUGGING PROGRAMS FILE PROCESSING PROGRAMS FIRMWARE MICROPROGRAM & MICROPRO- GRAMMING SYSTEMS HARDWARE CPUs I/O DEVICES MEMORIES INTERCONNECTION NETWORKS FAULTS ARE ATTRIBUTED TO: HARDWARE: 20%-65%; SOFTWARE: 20%-80%; PEOPLE: 15%-40%; AT&T’s: 20-40-40%; (2/3 applications + 1/3 OS) DS - IX - NFT - 24

  25. (WARNING!!!)VIEW 6: IF YOU DO NOT FOLLOW DEPENDABLE DESIGN METHODOLOGY YOU MAY END UP WITH THE FOLLOWING: SIX PHASES OF A PROJECT • ENTHUSIASM • DISILLUSIONMENT • PANIC AND HYSTERIA • SEARCH FOR THE GUILTY • PUNISHMENT OF THE INNOCENT • PRAISE AND AWARDS FOR THE NON-PARTICIPANTS (Author unknown – found in one of the computer companies) DS - IX - NFT - 25

  26. SYSTEM DEPENDABILITY CONCEPTS • RELIABILITY • Is a conditional probability that the system will perform its intended function without failure at time t provided it was fully operational at time t = 0 • AVAILABILITY • Instantaneous availability is the probability that a system is performing correctly at time t and is equal to reliability of non-repairable systems A (t) = R (t) • Steady-state availability is the probability that a system will be operational at any random point of time and is expressed as the fraction of time a system is operational during its expected lifetime As (t) = • SURVIVABILITY is the probability that a system will deliver the required service in the presence of a defined a priori set of faults or any of its subset DS - IX - NFT - 26

  27. APPROACHES • FAULT INTOLERANCE • FAULT TOLERANCE • MAINTAINABILITY • HARDWARE/SOFTWARE TRADE-OFFS DS - IX - NFT - 27

  28. HARDWARE/SOFTWARE CONTINUUM AND VERTICAL MIGRATION HARDWARE EXAMPLES M6800 MC68000 VAX-11/780 IBM-30XX CRAY-XMP C-205 SYSTOLIC ARRAYS, RECONFIGURABLE OR EXPERIMENTAL MULTICOMPUTERS INSTRUCTIONS INTEGER ARITHMETIC ADD/SUB MPY/DIV FLOATING-POINT ARITHMETIC VECTOR PROCESSING MULTIPROCESSING (e.g., submachine set-up) SOFTWARE VERTICAL MIGRATION is a transfer of functions’ implementation from software to firmware and/or hardware or vice-versa. Vertical Migration improves performance and dependability, and reduces cost. DS - IX - NFT - 28

  29. DEPENDABILITY (RELIABILITY) RINGS FOR FAULT TOLERANCE Dependability Rings Acceptance Test Operating System, Languages and Application Acceptance Test System Hardware Acceptance Test Register-Transfer Level Acceptance Test Logic Level Each Dependability Ring should provide measures and mechanisms for Fault Tolerance (Detection, Location, Testability and Recovery) DS - IX - NFT - 29

  30. A BOOTSTRAP – TEST RINGS IN A MULTICOMPUTER SYSTEM Network Memories Processor Diagnostic and Maintenance Processor (s) (Hardcore) Test Rings DS - IX - NFT - 30

  31. DEPENDABLE DESIGN METHODOLOGY • Identify fault classes, fault latency and fault impact • Determine qualitative and quantitative specs for fault tolerance and evaluate your design in specific environment • Identify “weak spots” and assess potential damage • Decompose the system • Develop fault and error detection techniques and algorithms • Develop fault isolation techniques and algorithms • Develop recovery/reintegration/restart • Evaluate degree of fault tolerance • Refine, iterate for improvement; try to eliminate “weak spots” and minimize potential damage DS - IX - NFT - 31

  32. REAL-TIME SYSTEMS DESIGN • Identify time/critical tasks and specify their timing (deadlines, durations, frequency, periodicity, if any). Characterize the system load and environment. • Characterize timing of a system (hardware and software). • Map timing specification onto a system timing (find the best resource allocation and scheduling methods), and incorporate concurrent monitoring. • Verify and validate the design for quantitative and qualitative specifications. • Refine, iterate and fine-tune the design. DS - IX - NFT - 32

  33. RESPONSIVE SYSTEM DESIGN • Determine qualitative and quantitative specifications for fault tolerance and task timeliness which meet user requirements. • Determine system timing (hardware and software) assess damage, availability and responsiveness. • Develop and time fault and error detection techniques and algorithms. • Develop and time fault isolation techniques and algorithms. • Develop time recovery/reintegration/restart. • Map timing specification onto system timing under appropriate assumptions and incorporate concurrent monitoring. • Evaluate responsiveness. • Refine and iterate for improvement. RESPONSIVE SYSTEMS NEED ARCHITECTS OF SPACE AND ARCHITECTS OF TIME DS - IX - NFT - 33

  34. REFERENCES(TEXTBOOK) • C. G. Bell, J. C. Mudge and J. E. McNamara “Seven Views of Computer Systems”, Chapter 1 in the book by the same authors titled “Computer Engineering”, Digital Press, 1978. • G.J. Lipovski and M. Malek, “Parallel Computing: Theory and Comparisons”, Wiley-Interscience, New York, 1987. • M. Malek, “Parallel Computer Systems Testing and Integration”, in the book titled “Testing and Diagnosis of VLSI and LSI”, M. G. Sami and F. Lombardi (eds.), Kluwer, 1988. • Pankaj Jalote, Fault Tolerance in Distributed Systems / Textbook Binding / Published 1994 • Dhiraj K. Pradhan, Fault-Tolerant Computer System Design, Textbook Binding, 1996. DS - IX - NFT - 34

More Related