1 / 39

SMU CSE 8314 / NTU SE 762-N Software Measurement and Quality Engineering

SMU CSE 8314 / NTU SE 762-N Software Measurement and Quality Engineering. Module 08 Analyzing Failures. Analyzing Failures. Most of what humans know has resulted from analyzing failures. The key is to learn from failures and not repeat them.

Télécharger la présentation

SMU CSE 8314 / NTU SE 762-N Software Measurement and Quality Engineering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SMU CSE 8314 / NTU SE 762-NSoftware Measurement and Quality Engineering Module 08 Analyzing Failures

  2. Analyzing Failures • Most of what humans know has resulted from analyzing failures. • The key is to learn from failures and not repeat them. • One way to learn from failures is to look for patterns common to many failures and extrapolate from that.

  3. Radar Error Investigated in Guam Jet Crash “AGANA, Guam -- A software error crippled an airport radar system that might have prevented last week’s deadly crash of a Korean Air jet in Guam, deferral investigators said Sunday. The FAA Radar Minimum Safe Altitude Warning System normally issues an alert if a jet is flying too low ... but the system ... was modified recently and an error apparently was inserted into the software.” • Later examinations showed the same flaw in several airport radars around the world.

  4. AOL Outage Leaves Computer Users in the Dark “America Online users found themselves unexpectedly off line Wednesday as a computer glitch crashed the network worldwide and caused the most extensive cyberspace blackout ever. ... more than 6 million users were affected by the outage, which began during an attempt to install new software at 3am ...” “The crash came at an awkward time for the company, which is reporting fourth-quarter earnings Thursday.” Dallas Morning News, August 8, 1996

  5. A Few More, as reported by ACM Software Engineering Notes: 1979 - 50 false alerts from NORAD defense system 1983 - Computer bug showed ghost train near Embarcadero station on San Francisco Muni 1983 - United Airlines 767 iced up because fuel-saving computer was over-efficient, causing engines to cool down too much on approach to Denver 1985 - “Compatible” teller machines of two British banks handled leap year differently, withholding cash and confiscating cards during New Year holiday 1985 - Woman killed daughter and tried suicide after computer incorrectly diagnosed incurable disease

  6. And More 1980 - Computer error caused nuclear reactor in Florida to overheat 1983 - Vancouver Stock Exchange Index rose by 50% when 2 years of round-off errors were corrected 1981 - Department store anti-theft microwave device reprogrammed the heart pacemaker of a customer, killing its user 1984 - 180 degree heading error caused Soviet test missile to aim for Hamburg instead of the Arctic

  7. Unnamed Financial Institution “... The institution decided to mailshot 2000 of its richest customers, inviting them to buy extra services. One of its computer programmers wrote a program to search through its databases and select its customers automatically. He tested the program with an imaginary customer called Rich Bastard. “Unfortunately, an error resulted in all 2000 letters being addressed ‘Dear Rich Bastard’.” New Scientist, 28 August, 1993 (Feedback Column)

  8. Horror Storiesfrom Weinberg • National Bank • Public Utility • State Lottery • Stock Broker • Buying Club Statement • Universal patterns & recommended approaches We will use these examples to analyze the common patterns. Weinberg, Chap. 10 (see references)

  9. Example 1 -- National Bank • COBOL program generated a message with the details of each Loan -- This was a legal document • Each message had a unique serial number • Problem: running out of digits for serial number • Solution: top management said “expand the serial number field - pronto” • Implementation: changed, tested, shipped

  10. overlaid digits National Bank -- Result • Actual loan receipts were less than estimated loan receipts by a small amount per loan • But when all loans were added up, it was more than a billion dollars! • Diagnosis of error: serial number expansion overlaid two low order digits of interest rate field • 7.3845% becomes 7.3801%

  11. Example 2 -- Public Utility • Problem: new rates required new billing algorithm • Solution: change a few constants • Implementation: changed, tested, shipped

  12. Public Utility - Some Time Later ... • Receipts lower than expected • Diagnosis of error: two digits of one constant had been transposed: 75 became 57 • Loss was $42M to $1.1B • Weinberg discovered four similar cases!

  13. Example 3 -- New York State Lottery • Goal: Print tickets for a special lottery (variant of regular lottery) • Unplanned but “trivial” change to program • Solution: change 1 digit in existing program • Implementation: change, test, ship

  14. New York State LotteryA Few Weeks Later ... • Two customers with identical lottery numbers • Public outcry • Confidence in lottery integrity plunged • Explanation: “trivial software error” • did not appease the public • All lotteries shut down pending report of “blue ribbon investigating committee”

  15. New York State Lottery11 Months Later ... • Lotteries finally reestablished • Total loss of revenue: $44-55 Million

  16. Example 4 -- Stock Broker’s Statement • Problem: Spurious line of $100,000.00 showed up on quarterly report summary of 1,500,000 accounts • 20% of customers noticed and called • 50,000 hours of account rep hours spent dealing with phone calls ($1 Million) • Unknown amount of customer time & loss of customer confidence • Diagnosis: Failure to clear a line in the print area before generating output.

  17. Example 5 - Buying Club Statement for Members • Situation: new phone number -- must be printed on bills for customer inquiries • Problem: one incorrect digit -- customers called a doctor instead Doctor was flooded with erroneous calls Patients could not get through to doctor

  18. Buying Club Ramifications • Doctor sued buying club • Doctor received a large but undisclosed amount • Analysis: • Copied a constant wrong in a COBOL program while updating the phone number information • Inadequate inspection due to lack of time and trivial nature of change

  19. Universal Pattern • Existing system is considered reliable • Quick change to the system is desired, usually based on a request from high up in the organization (i.e., not planned, little time) • Change is deemed trivial • Change is made without the usual software engineering safeguards (under orders from top)

  20. Universal Pattern (continued) • Change is put into normal operation • Effect is small, so not noticed immediately • Cumulative effect has very large consequences The change is trivial to IMPLEMENT BUT ... The potential CONSEQUENCES are huge

  21. Universal Pattern Continuesafter Problem is Detected • Management reaction is to minimize the magnitude, so consequences continue longer than necessary • When magnitude is undeniable, the programmer is fired for doing exactly what he or she was ordered to do

  22. What Happens After That ... • The supervisor of the programmer is demoted to take the programmer’s place and “do it right” • The manager who assigned work to the supervisor is given a special assignment to improve software engineering practices • Higher managers are untouched - deemed blameless

  23. Two Rules ofFailure Prevention 1) Nothing is too small to be worth observing 2) Loss of XX dollars is always the responsibility of an executive whose financial responsibility exceeds XX dollars

  24. Weinberg’s Suggestion for Failure Prevention “What is the earliest, cheapest, easiest & most practical way to detect failures?” Answer: look at the other organizations • We are unwilling to learn from the others’ mistakes until we make the same ones. • We assume it cannot happen to us because if it did we would have disaster and we don’t want to think about that

  25. Why do we Have So Many Software Failures? • Because people are not perfect! • Physicists: 2nd law of Thermodynamics: “Nothing can be perfect” • Psychologists: When copying numbers, people are highly likely to make mistakes • Imperfection is the expected situation • Perfection is the exception

  26. What to Observe • Observation of mistakes has no significance • Observation of HOW people make mistakes can help us devise processes to catch them and make them less likely

  27. What Do We Tend to Do in the Event of Failure?

  28. What Do We Tend to Do in the Event of Failure? Blame the People who Make Mistakes

  29. Consequences of Blaming People • An environment where people hide mistakes instead of seeking them out and airing them to find ways to correct them • We waste energy searching for culprits • We distract attention from the proper responsibility of mangers -- defining procedures to catch failures and prevent dire consequences

  30. Ways People Fail - I Frailty: people tend to make mistakes Folly: people have good but wrong intentions, e.g. • “hard coding” constants into Cobol programs Fatuousness: unwilling to learn or incapable of learning from mistakes. • Consequently we repeat same mistakes Fun: (hacker’s disease) trying to have a good time and losing one’s sense of responsibility

  31. Examples of “Fun” • Subroutine flashed all the lights on the mainframe and shut it down for 10 minutes • Virus displayed an incorrect screen at random points in the program • Macintosh finger pointer changed to use second finger rather than index finger (not detected by software testers)

  32. Overcoming these Tendencies • Make everything open and visible • Encourage detection and correction of defects • Make the work fun so people do not need to invent fun things to do • Encourage professionalism in the workplace

  33. Ways People Fail - II • Fraud: people are greedy • Managers should figure out what is worth stealing and take precautions about it • Fanaticism: revenge, imagined or real wrongs • Managers should keep this in mind when instituting discipline and precautions against fraud, etc. • Failure of Hardware: unlikely but possible • Fate: Bad luck • Global string replace “luck” with “management”

  34. Hardware FailureUnlikely, but ... • When hardware failure is blamed ... • This is a convenient alibi. Look for something being concealed. • Make sure there are suitable procedures, such as backing up source code, etc. • Review your hardware supplier - should you be managing the relationship differently? • Look for user actions disguised as hardware failure. • User actions often not correctly predicted

  35. The Costs of Failure

  36. Summary • Software and computer failures are responsible for many horror stories • A common pattern is to focus on the work required to make the change rather than the potential consequences of the change • The normal situation is for people to make mistakes, but systems are too often designed for people who behave perfectly

  37. Summary • Study the ways people fail • The later the detection of the error, the more costly the consequences tend to be

  38. References • Weinberg, Gerald, Quality Software Management, Volume 2 -- First-Order Management, Dorset House, 1993, ISBN 0-932633-24-2

  39. END OF MODULE 08

More Related