210 likes | 339 Vues
This analysis explores the critical role software defects play in system availability, emphasizing that software causes 60% of outages. It distinguishes field failures—occurrences in production—from development defects, highlighting their real-world implications. By examining data from the RETAIN database, the study categorizes errors and their triggers, detailing the types of failures that impact systems most significantly, particularly overlay errors. This foundational work aims to guide future research and improve understanding of software-related issues in operating systems.
E N D
Software Defects and their Impact on System Availability -- A study of field failures in operating systems IBM T.J. Watson 1991 Presenter: Shan Lu
Why software defect? • More severe than hardware defect • Software cause 60% of outage [Gray’90] • Not well understood and studied • Different characteristics from hardware • a bug can not be compared with a fault hardware component
Why ‘field’ failure? • Field failure • Failures that happen in production run • Different from defects detected in development & testing • Reflect the real world ‘impact’
Overview • Analyze field failures in Operating System • Get statistics on • Impact of errors • Error type breakdown • Error triggering breakdown • Failure symptom distribution • Others • Use these results to guide future research
Outline • Motivation • Overview • Data source • Design • Analysis results • These results indicate … • Related work
Data source RETAIN • RETAIN database • Remote Technical Assistant Information Network • APAR • Manually extract • Error type • Error trigger • Symptom • Sample the APARs APAR Symptoms Context & environment How to fix Standard attributes Severity (1—4) HIPER ILP
Overlay errors and general errors • Overlay errors • Errors cause storage overlay (memory corruption) • Hard to find and fix • Big impact on availability • Get sample set by key word searching • General errors • All errors including overlay errors • Get sample set by random sampling • Comparison will be made
Error Type • Orthogonal and confidently large class • Totally 13 types • Overlay: 8 • Allocation management • Pointer management • Copy overrun • Regular: plus 6 • Semantic errors • Synchronization error • Unclassified
Error triggering events • Boundary conditions • Bug fixes • Client code • Recovery or error handling • Timing • Unknown
Symptom codes • ABEND • Addressing error (may restart) • Endless wait • Incorrect output • Incorrect output without detecting the failure • Loop • OS goes to infinite loop. Needs restart • Message • Error message printed. Local recovery, no ABEND
Outline • Motivation • Overview • Data source • Design • Analysis results • These results indicate … • Related work
Impact • Does overlay errors have more impact?
Error Type of Overlay Errors • Which is most popular? • Copying Overrun (20%) • Allocation Mgmt. (19%) • Who has most impact? • Allocation Mgmt. (31%HIPERs, 17% IPLs) • Pointer Mgmt. (16%HIPERs, 27%IPLs) • More about copying overrun • Less impact (13%HIPERs, 5% IPLs) • Why?
Others Overlay Error Administrative Err. (Semantic Err.) Synchr. Error (?) Error Type of Regular Errors • Who will dominate? • Impact • HIPERs: Overlay—14%; Undefined State—49% • IPLs: Overlay –4%;Synchr.—70% Copying Overrun Type mismatch Undefined State
Error Triggering Events • What’s your guess? • Most timing-related problems? (Heisenbug) • Breakdown • What does it tell us?
What else we can do? • Dig more information from their RETAIN • Do better classification • Try more interesting question • Similar analysis on different applications • Try similar things for open source codes
What does the data tell us? • Test case design • Test boundary condition • Test recovery code • Bug detection • Memory bug detector • Synchronization bugs • Tools help fixing bugs
Something Related • National Vulnerability Database • Bugzilla (mozilla 1998)