1 / 21

Death by Software

Death by Software. The Therac-25 Radio-Therapy Device Brian MacKay ESE6361 - Requirements Engineering – Fall 2013. The Atomic Age. World War II ushered in the atomic age The start of the nuclear arms race In many countries… The question was how to harness this power for peaceful purposes.

dara-bush
Télécharger la présentation

Death by Software

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Death by Software The Therac-25 Radio-Therapy Device Brian MacKay ESE6361 - Requirements Engineering – Fall 2013

  2. The Atomic Age • World War II ushered in the atomic age • The start of the nuclear arms race • In many countries… • The question was how to harness this power for peaceful purposes

  3. In Canada: AECL • Atomic Energy of Canada Limited is a “Crown Corporation” • Designed and implemented a Heavy Water nuclear reactor • The CANDU system • It also included AECL-Medical • Harnessing the atom for medical reasons

  4. AECL & CGR – Medical Accelerator Technology • AECL-Medical and the French company: la Compagnie Générale de Radiologie (CGR) • Worked together during the 1970s on using linear accelerators for radio-therapy • High energy, low dose, Electron beams, or • A stream of photons in the X-Ray spectrum • The two companies’ partnership produced • The 6 MeV, X-Ray only “Therac-6” • The dual mode, 20 MeV “Therac-20”

  5. Therac-6 & Therac-20 • Stand-alone electro-mechanical units • Operator could • Set all settings manually • Position beam devices manually • Once everything was set, and system was “safe” – deliver the dose • The system had an optional computer that allowed a simpler UI • A Digital Equipment PDP-11 • 32 kilobytes of memory • All assembly code

  6. True Innovation: the Therac-25 • AECL only – CGR partnership had dissolved • Used a Double-Pass accelerator • Halved the space that the Therac-6 & Therac-20 had occupied • Made the computer the primary controller • No stand-alone manual mode • Shipped in 1983 • Still used a DEC PDP-11

  7. It was the best on the market… • Except… • It seriously injured 6 patients between 1985 and 1987 • Killing 3 of those patients • All because of software

  8. Hubris • When an engineer graduates in Canada, he/she attendsThe Ritual Calling of an Engineer • And gets an Iron Ring • Rudyard Kipling wrote the ceremony • Instills a sense of professionalism • And humility

  9. Supreme Faith in Software • It appears that this device had rigorous safety engineering on the hardware side • Complete hazard analysis – fault tree • On the software side, the likelihood of error was described in insanely low terms • Fault probabilities on the order of 10-9 and 10-11 • “Software does not degrade due to wear, fatigue or the reproduction process” • They had no expectation that a bug could cause a problem

  10. Malfunction 54 • When there was a problem, the UI displayed the word “Malfunction” followed by a number 1-64 • There was NO documentation of what these codes were in the user manual • An internal AECL service manual described #54 as “dose input 2” and pointed out that this error code was only there for internal diagnostic reasons • Under normal conditions, an operator might see as many as 40 malfunction codes in a day • But Malfunction 54 was very rare • They were easily dismissed by pressing [P] (for “Proceed”)

  11. Electron Mode vs. X-Ray Mode • In Electron Mode a low power beam is scanned across the patient • In X-Ray mode a high power beam is aimed at a target, producing X-Rays, which then irradiate the patient • The electron scanning mechanism and X-Ray target were mounted on a turntable • The position was controlled by the computer

  12. Usability • User interface was a VT-100 Green Screen • Contained the Prescription • Entered by the operator • Originally – on error, prescription had to be re-entered • Usability studies changed this, near the end of the dev cycle • Introduced a major error PATIENT NAME : JOHN DOE TREATMENT MODE : FIX BEAM TYPE: X ENERGY (MeV): 25 ACTUAL PRESCRIBED UNIT RATE/MINUTE 0 200 MONITOR UNITS 50 50 200 TIME (MIN) 0.27 1.00 GANTRY ROTATION (DEG) 0.0 0 VERIFIED COLLIMATOR ROTATION (DEG) 359.2 359 VERIFIED COLLIMATOR X (CM) 14.2 14.3 VERIFIED COLLIMATOR Y (CM) 27.2 27.3 VERIFIED WEDGE NUMBER 1 1 VERIFIED ACCESSORY NUMBER 0 0 VERIFIED DATE : 84-OCT-26 SYSTEM : BEAM READY OP.MODE: TREAT AUTO TIME : 12:55. 8 TREAT : TREAT PAUSE X-RAY 173777 OPR ID : T25VO2-RO3 REASON : OPERATOR COMMAND:

  13. A Race Condition – UI & Operations Threads • In the Therac-25, the prescription information was entered • The Electron/X-Ray mode • Then a command to execute • If the operator • Entered an X-Ray command in error • Re-edited the page and changed it to Electron • Then executed the dose, all within 8 seconds • Then the patient was given an X-Ray dose directly through the Electron turntable element PATIENT NAME : JOHN DOE TREATMENT MODE : FIX BEAM TYPE: X ENERGY (MeV): 25 ACTUAL PRESCRIBED UNIT RATE/MINUTE 0 200 MONITOR UNITS 50 50 200 TIME (MIN) 0.27 1.00 GANTRY ROTATION (DEG) 0.0 0 VERIFIED COLLIMATOR ROTATION (DEG) 359.2 359 VERIFIED COLLIMATOR X (CM) 14.2 14.3 VERIFIED COLLIMATOR Y (CM) 27.2 27.3 VERIFIED WEDGE NUMBER 1 1 VERIFIED ACCESSORY NUMBER 0 0 VERIFIED DATE : 84-OCT-26 SYSTEM : BEAM READY OP.MODE: TREAT AUTO TIME : 12:55. 8 TREAT : TREAT PAUSE X-RAY 173777 OPR ID : T25VO2-RO3 REASON : OPERATOR COMMAND: Malfunction 54

  14. Why Have One Deadly Bug? • A second deadly bug was eventually found in the Therac-25 • The system periodically tested if everything is positioned properly, setting a variable with the result of the test • A zero indicated OK • Instead of simply setting the value to 1 or 0, the program incremented the value • And, the variable was a byte • The result was that every 256 tests of the positioning, the system would falsely indicate that everything was ready to proceed.

  15. Noteworthy: The Users Found the Bugs • It’s worth noting that AECL’s reaction to the problems initially was denial • Eventually, the got to the stage where they did piecemeal fixes • Without the efforts of the staff at the East Texas Cancer Center in Tyler, AECL might never have acknowledged the first bug • After two accidents – with the same operator – they spent time trying to recreate the race condition • After the Therac-25, the FDA changed the way it evaluated software (and software engineering) in medical devices.

  16. The Scorecard • One patient died of cancer, but would have died of radiation poisoning in a few weeks had the cancer not killed him

  17. Not the Bugs – The Software Engineering • All software systems have bugs • Even Knuth hands out the occasional $2.56 check • AECL coalesced their entire operator interface, control system and safety system into one program • They apparently had very little in the way of formal requirements gathering, design or development standards • All of the software was developed by one programmer • Their reaction to the problems was to fix them one at a time

  18. Software Reuse • The Therac-20 reused some of the software from the Therac-6 • The Therac-25 reused software from both of the previous models • But • The earlier models had hardware interlocks to prevent over-dosing • The desire to reuse previous software resulted in a • Home-made real-time operating system • On an expensive, 10 year old computer system • Running a program written entirely in assembly language • That relied on global variables for inter-task communication – without synchronization

  19. No Requirement to Separate Layers • AECL architected the Therac-25’s software into a single point of failure • This was far from accepted practice in the early 1980s • Safety systems were migrating from hardware to software • But… they were usually separate, simpler systems – e.g. PLCs • By the early 80s, there were usually three distinct layers • Safety and integrity • Control and positioning • Operator interface and supervisory

  20. Testability – Auditing • AECL’s task architecture and real time OS made adequate testing nearly impossible • Look at the deadly errors – neither is discoverable through testing • No auditing of operations, or failures was included in the system • After all the issues with the Therac-25, a check was done on the Therac-20 system and the same bugs were found • But, because that system had mechanical interlocks, no injuries resulted

  21. References • “Medical Devices – The Therac-25”,Levenson, Nancy.http://sunnyday.mit.edu/papers/therac.pdf • “An Investigation of the Therac-25 Accidents”, Levenson, Nancy and Turner, Clark S., IEEE Computer, Vol. 26, No. 7, July 1993, pp. 18-41http://courses.cs.vt.edu/~cs3604/lib/Therac_25/Therac_1.html • “Fatal Dose - Radiation Deaths linked to AECL Computer Errors”,Rose, Barbara Wade, Saturday Night (magazine), June, 1994http://www.ccnr.org/fatal_dose.html • “Safety-Critical Computing: Hazards, Practices, Standards, and Regulation”, Jacky, Jonathan, http://staff.washington.edu/jon/pubs/safety-critical.html • “Therac-25”,Wikipediahttp://en.wikipedia.org/wiki/Therac-25 • “PDP-11”, Wikipediahttp://en.wikipedia.org/wiki/PDP-11 • “PDP-11 architecture”, Wikipediahttp://en.wikipedia.org/wiki/PDP-11_architecture

More Related