SOFTWARE ENGINEERING • Today - motivation:- Software Engineering: Why?- Software Engineering: What? • We will start by examining some example cases.
Agency sends 16,000 tax forms to one man / 1 Source:http://www.csc.calpoly.edu/~jdalbey/205/Resources/irs_bug.html SACRAMENTO (Scripps-McClatchy)--Somewhere in the San Diego area, there's a dentist who's probably still grinding his teeth over his latest brush with California’s tax collectors. During one week in September his office received an avalanche of tax forms in the mail -- 16,000 sets of forms in 16,000 individual envelopes. "We did it," admitted Suzanne Schroeder of the state Employment Development Department. "It was a computer problem." The glitch occurred in a mailing of 1.4 million pieces that is sent out each quarter to employers, Schroeder explained.
Agency sends 16,000 tax forms to one man / 2 The department was using new computer software for producing address labels which was provided by the U.S. Postal service, Schroeder said. The Postal Service software was designed to read the word "suite" abbreviated as "ste," she continued. But the addresses in the department's database abbreviate "suite" as "su". When the software couldn't read "su", it was supposed to jump to the previous line and read it again, Schroeder said. But for this particular address, there was a foreign spelling on the previous line and the software couldn't read that either. That set off a series of other jumps, she added, until the computer began spitting out the same address over and over again. "We alerted the postal authorities and they corrected the problem with what they call a 'software patch,'" she said.
Inappropriate Bank Letter Form Reference: http://catless.ncl.ac.uk/Risks/14.89.html#subj3.1 <Kenneth.Wood@prg.ox.ac.uk>Fri, 27 Aug 93 16:55:35 BST The Feedback section of the latest New Scientist relates the following Computer Weekly story about an unfortunate programmer at an unnamed bank. Apparently, the bank wanted to target its wealthiest customers with a mailshot promoting various new services and the programmer in question wrote a program to select the 2000 wealthiest customers from the bank's records and to generate an appropriate letter for each. In the process of testing the program, he made use of a fictitious customer named Rich Bastard. Unfortunately, as you may already have guessed, something went amiss and every single one of the bank's 2000 prize customers received a letter which began "Dear Rich Bastard, ..."
Mars Orbiter Failure / 1 Reference: http://www4.cnn.com/TECH/space/9911/10/orbiter.03/ WASHINGTON (CNN) -- Failure to convert English measures to metric values was the root cause of the loss of the Mars Climate Orbiter, a spacecraft that smashed into the planet instead of reaching a safe orbit, a NASA investigation concluded Wednesday. In a scathing report released Wednesday, an investigation board concluded that NASA engineers failed to convert English measures of rocket thrusts to newton, a metric system measuring rocket force. One English pound of force equals 4.45 newtons. A small difference between the two values caused the spacecraft to approach Mars at too low an altitude and the craft is thought to have smashed into the planet's atmosphere and was destroyed.
Mars Orbiter Failure / 2 The report cited other contributing causes to the September 23 failure, including: • Undetected mistakes in modeling of spacecraft velocity changes. • Insufficient familiarity with the spacecraft on the part of the navigation team. • Inadequate training. • Inadequate communications between project teams. • The report also said the mission navigation team was overworked and not closely supervised by independent experts. • The panel made 10 different recommendations to ensure that a similar mishap is avoided with the Mars Polar Lander, currently en route for a December 10 touchdown on the red planet.
Mars Polar Lander Failure Reference: http://www.space.com/businesstechnology/technology/mpl_software_crash_000331.html The most likely cause of the lander’s failure, investigators decided, was that a spurious sensor signal associated with the craft’s legs falsely indicated that the craft had touched down when in fact it was some 130-feet (40 meters) above the surface. This caused the descent engines to shut down prematurely and the lander to free fall out of the martian sky.
Ariane Crash / 1 Reference: http://www.around.com/ariane.html (Story by James Gleick) It took the European Space Agency 10 years and $7 billion to produce Ariane 5, a giant rocket capable of hurling a pair of three-ton satellites into orbit with each launch and intended to give Europe overwhelming supremacy in the commercial space business. All it took to explode that rocket less than a minute into its maiden voyage last June, scattering fiery rubble across the mangrove swamps of French Guiana, was a small computer program trying to stuff a 64-bit number into a 16-bit space. One bug, one crash. Of all the careless lines of code recorded in the annals of computer science, this one may stand as the most devastatingly efficient. From interviews with rocketry experts and an analysis prepared for the space agency, a clear path from an arithmetic error to total destruction emerges.
Ariane Crash / 2 To play the tape backward: At 39 seconds after launch, as the rocket reached an altitude of two and a half miles, a self-destruct mechanism finished off Ariane 5, along with its payload of four expensive and uninsured scientific satellites. Self-destruction was triggered automatically because aerodynamic forces were ripping the boosters from the rocket. This disintegration had begun an instant before, when the spacecraft swerved off course under the pressure of the three powerful nozzles in its boosters and main engine. The rocket was making an abrupt course correction that was not needed, compensating for a wrong turn that had not taken place. Steering was controlled by the on-board computer, which mistakenly thought the rocket needed a course change because of numbers coming from the inertial guidance system. That device uses gyroscopes and accelerometers to track motion. The numbers looked like flight data -- bizarre and impossible flight data -- but were actually a diagnostic error message. The guidance system had in fact shut down.
Ariane Crash / 3 This shutdown occurred 36.7 seconds after launch, when the guidance system's own computer tried to convert one piece of data -- the sideways velocity of the rocket -- from a 64-bit format to a 16-bit format. The number was too big, and an overflow error resulted. When the guidance system shut down, it passed control to an identical, redundant unit, which was there to provide backup in case of just such a failure. But the second unit had failed in the identical manner a few milliseconds before. And why not? It was running the same software. This bug belongs to a species that has existed since the first computer programmers realized they could store numbers as sequences of bits, atoms of data, ones and zeroes: 1001010001101001. . . . A bug like this might crash a spreadsheet or word processor on a bad day. Ordinarily, though, when a program converts data from one form to another, the conversions are protected by extra lines of code that watch for errors and recover gracefully. Indeed, many of the data conversions in the guidance system's programming included such protection.
Ariane Crash / 4 But in this case, the programmers had decided that this particular velocity figure would never be large enough to cause trouble. After all, it never had been before. Unluckily, Ariane 5 was a faster rocket than Ariane 4. One extra absurdity: the calculation containing the bug, which shut down the guidance system, which confused the on-board computer, which forced the rocket off course, actually served no purpose once the rocket was in the air. Its only function was to align the system before launch. So it should have been turned off. But engineers chose long ago, in an earlier version of the Ariane, to leave this function running for the first 40 seconds of flight -- a "special feature" meant to make it easy to restart the system in the event of a brief hold in the countdown.
THERAC-25 – A Killing Treatment / 1 Reference: http://www.uoguelph.ca/~meby/Story by Mark Eby Therac 25 was engineered by Atomic Energy Canada Limited (AECL) in conjunction with a French company CGR. It was an advancement in the fight against cancer. The million dollar, dual-mode linear accelerator was first developed in 1976 and the commercial version was available in 1982. There were eleven installed altogether, 5 in the USA and 6 in Canada. The machine precisely aimed a beam of radiation at a patient to treat tumors caused by cancer. The x-rays produced were used to reach deeper tissue in the human body. The machine had two settings, a low energy, 200-rad mode, and a x-ray mode of 25 million electron volt capacity. The low setting could be directly aimed at the patient whereas the high-energy mode had to aim at the patient through a thick tungsten shield. It was controlled through a terminal hooked up to an old Vax mainframe so that a technician could run it from another room.
THERAC-25 – A Killing Treatment / 2 In almost every case treatment went fine with no complications and it provided the necessary radiation to cure the cancerous tumors. In six of the cases of people being treated something went wrong. Human error, along with a bug in the software caused the treatment to malfunction. Normally a patient is treated with low-energy doses of electrons from Therac 25. It is the increased, high-energy x-rays that caused a problem. In each case that Therac 25 malfunctioned, the technician entered the wrong dosage and then corrected it. The two modes, electron mode "e" and x-ray mode "x" were controlled form the Vax terminal. In Texas, the technician entered mode "x" instead of the proper mode "e". Upon realization of the error the technician scrolled up to "Edit", corrected the mistake, hit "e" and then hit "Enter". The total time that it took for the sequence of events to occur was less then eight seconds.
THERAC-25 – A Killing Treatment / 3 The technician believed that everything was all right and pressed "D" when the "Beam Ready" prompt came up. At the completion of the inputs an error message showed, so the technician reset the computer and did the sequence again. This time an error message showed and the system stopped. Meanwhile, the man being treated was being burnt by the radiation so bad that he got off the table he was on and found the technician. He complained about the pain he was feeling in his shoulder but the technician had no reason for the cause of the pain. It was believed that the proper mode was in place and that the proper dosage was being given. It was not until three weeks later when the same events occurred that the problem was discovered. When the technician reset the computer the arm with the tungsten withdrew but the beam did not switch. The man was bombarded with 25 000-rads with 25 million electron volts, 125x the normal dose. The man died four months later.
Huge USA/Canada electricity blackout, August 2003 • NOVEMBER 20, 2003 (COMPUTERWORLD) - The task force responsible for investigating the cause of the Aug. 14 blackout that crippled most of the Northeast corridor of the U.S. and parts of Canada concluded that a software failure at FirstEnergy Corp. "may have contributed significantly" to the outage. • The Interim Report of the U.S.-Canada Power System Outage Task Force, released today, highlights the failure of various IT systems that thwarted utility workers' ability to contain the blackout before it cascaded out of control, and found no evidence that malicious insiders or external saboteurs were responsible for the cascading power outage. • According to the task force, FirstEnergy 's Alarm and Event Processing Routine (AEPR), a key software program that gives operators visual and audible indications of events occurring on their portion of the grid, began to malfunction. As a result, "key personnel may not have been aware of the need to take preventive measures at critical times, because an alarm system was malfunctioning."
USA/Canada electricity blackout, flow of events • The task force also provided a "cyber timeline" listing significant electronic control events that contributed to the rolling blackout. • The first major event occurred at 12:40 p.m. EDT, when an engineer from the Midwest Independent Transmission System Operator disabled an automatic periodic trigger on software that allows the utility to determine the real-time status of the power system for its region. That action was needed to conduct a manual check of the network, the report states. However, the engineer later went to lunch and forgot to re-engage the automatic trigger. • By 2:40 p.m. EDT, the AEPR software began to malfunction, although FirstEnergy engineers weren't aware of the problem at the time. One minute later, FirstEnergy's AEPR server failed and switched over automatically to the backup server. Engineers, however, remained unaware of any other problems with the software. Then, at 2:54 p.m., the backup server failed. • At 3:05 p.m., when the first power-line failure occurred at FirstEnergy, system operators did not receive alarm notifications because of the malfunctioning AEPR software. • That software continued to malfunction until 3:42 p.m., when the lights at FirstEnergy's control facility flickered and alerted engineers to the larger problem. It was only then that an operator noticed the problem with the AEPR software.
USA/Canada electricity blackout, conclusions • In a statement released Wednesday, FirstEnergy President and Chief Operating Officer Anthony J. Alexander said the company remains "convinced" that the blackout cannot be traced to any one utility system. • "We recognize that our computer system experienced problems that day," said Alexander. "After an extensive analysis, we submitted a report to the Task Force that identified a previously undetected flaw in vendor software that resulted in the loss of an alarm function, affecting our operators' understanding of events on our system." • However, "by focusing its analysis on a few selected events, the conclusions the Task Force reached don't address the complexity and magnitude of operations on the interconnected grid," Alexander said. • (All of this story is from Computerworld)
That’s just a few examples of failing software – so what? • These are not just all “funny failures”. • Big software projects may be the most complicated engineering products ever produced by mankind. • Let’s have a look at statistics, to see how well software projects are doing in general. • Typically, we consider a software project to have succeeded, if satisfactory software is produced within expected time and budget. • A total failure is such, where the software can not be used at all or is not produced at all. • In between there are cases, where the software is late or needs considerable changes.
Economical profit = software project success? • Consider a company, which does the following:1. It makes software for, say, office use. The software contains a number of bugs.2. When customers complain about the bugs, the company makes a new version of the software to remove these bugs. 3. In addition to bug fixes, the company makes several new features to the software. However, the features contain new bugs. 4. The customers buy the new version, and complain about the bugs.5. Go to Step 2.
Success Studies - ONNI’88 (Finland) • Over 100 projects • Good success: 33% • Questionable: 42% • Failure: 25%
ONNI’88 - Reasons for Failure • Inabilities of the Software Engineering personnel • Insufficient resources • Management problems
Success Studies:USA’82 - Gibson & Singer • 18 projects • Good success: 17% • Partly usable/in use: 28% • Satisfactory (just about?) 11% • Failure: 22% • No evaluation: 11%
USA’82 – Reasons for Failure • Organisational problems • New work methods and salary policies • Unexpected changes in business
Forsman’s Studies (Finland) • From Forsman’s book ”ATK-projektin läpivienti”, Suomen ATK-kustannus Oy, 1995 • 17 projects • Good success: 18% • Partially usable/in use: 29% • Satisfactory (just about?): 29% • Failure: 24%
Forsman – Reasons for Failure • Problems in organisation and attitudes • The customer could not decide what it wants • Problems with customer and software producer communication • Inexperience of the software producer
General Success Factors • Good project management • Understanding the needs and freezing the requirements • Controlled implementation and delivery • Skilled project personnel • Sufficient resources • Good communication between groups
General Major Errors • Too optimistic design • Over-emphasizing technology • Management problems • No profitability pre-evaluation • Unrealistic resources • Communication problems
A Belief in Systematic Work Methods • There is no way to guarantee success • Sometimes also the not so good practices seem to bring some success • However, the studies suggest that the way software is made is meaningful for the success, at least statistically • Practical observations support the belief
Large Systems • Some problems grow with the size of the software, below we give example sizes of large systems: • Dutch KLM airline reservation system (1993), 2 000 000 assembler loc (lines of code) • Unix System V, relese 4.0 with Xnews and X11, over 3 700 000 loc • Nokia NMS/2000 network management system, over 2 400 000 loc • NASA Space Shuttle software, 40 000 000 lines of object code • IBM OS360: 5000 man years of development
General Examples of Risks with Failing Software • Nuclear war • General technical device malfunction (e.g. failing life-support devices) • Economical transaction failures • Economical losses in business • Personal tragedies from faulty information • Y2K was considered to imply several different kinds of risks, but all went reasonably well
Software Engineering – What? • IEEE: Software engineering is the systematic approach to the development, operation, maintenance, and retirement of software • An earlier definition: Software engineering is the establishment and use of sound engineering principles in order to obtain economically software that is reliable and works efficiently on real machines.
Software Engineering – practical observations • Software engineering concerns the construction of large programs. • Mastering complexity is essential. • Regular co-operation between people is an integral part of the process. • Software evolves. • Software development efficiency is important. • To answer all these challenges and make good quality software is hard.