1 / 29

Safety and Software Engineering

Audrey
Télécharger la présentation

Safety and Software Engineering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Safety and Software Engineering Jim Land BSAE WVU, 1962 MSAE USC, 1965 Member SAE Co-Founder, High Integrity Solutions, Ltd. President, Irvine Labs, Inc. Associate USC CSE

    2. A View of Safety A life without adventure is likely to be unsatisfying, but a life in which adventure is allowed to take whatever form it will, is likely to be short. Bertrand Russell Whatever we do, we are exposed to degrees of unsafe situations There are extremes Fear of leaving the house Unrestrained high-danger sports In general, society sets levels of acceptable risk Individually, we operate within these levels as set by society In these next two classes we will look at aspects of safety and its impact on systems and software engineering

    3. High Assurance We are really interested in a class of systems called High Assurance Systems Types of HAS Safe Systems Secure Systems Systems in which the financial impact of failure is large (banking systems) System in which the environmental impact of failure is large (nuclear, waste water, etc.) For ease, we will limit our consideration to safe systems

    4. Our Field of View Where software is likely to be involved, and Software is involved in systems that demand High Integrity safety: Preventative -- Software is used to assure an acceptable level of safety in the system architecture or, Causative -- Software could be a contributing cause of a systems failure to deliver an acceptable level of safety

    5. A Two Day Overview Day One Engineering For Safety, a Perspective High Integrity Overview Definitions and Concepts Examples of High Integrity Systems Popular examples of Safety failures Specifying High Integrity Systems Domain, Requirements and Stakeholder Needs Analyzing High Integrity Systems Methods and Techniques Industrial Standards Day Two Developing Safer Systems Architectures Redundancy Fault Tolerance The Byzantine Generals Problem Development Processes Assuring Safer Systems Safety Engineering Verification and Validation Concepts Certification A Compendium of Tools

    6. Some good references

    7. What is safe? (See IEC 61508) Safe measured in the eyes of the beholder Condition of exposure under which there is a practical certainty that no harm will result to exposed individuals. Free from danger or the risk of harm. Freedom from unacceptable risk of harm IEC 61508 Harm (in the context of a system) Physical injury, or the damage to health, property or the environment, that may be caused by the system Economic harm Risk is the probable rate of occurrence of a hazard causing harm and the degree of severity of the harm. Risk has two elements: the frequency with which a hazard occurs, and the consequences of the hazardous event.

    8. Safety Integrity Level Level of Safety a level of how far safety should be pursued in a given context, assessed to an acceptable level of risk, is based on the values of society. In order to achieve an acceptable level of risk, we need to determine whether: the risk is so great that it must be shunned; or the risk is, or has been made, so small as to be insignificant, or the risk falls between (a) and (b), and that it has been reduced to the lowest level practicable (bearing in mind the benefits flowing from its acceptance and taking into account the costs of any further reduction). Safety Integrity Level (SIL)

    9. FAA AC 25.1309-1A

    10. Safety Integrity Level (SIL)

    11. More Definitions Error: A design flaw or deviation from a desired or intended state May or may not lead to a hazard A state Fault: A higher order safety related event All failures are faults Not all faults are failures (The 9/11 airplanes worked perfectly as they were flown in to the twin towers.) Hazard: State or condition of a system that leads to an accident Accident: an event that results in at least a specified level of loss Failure: inability of the system to perform its intended function; a behavior Reliability: The probability that the system will perform its intended function satisfactorily for a prescribed time under stipulated environmental conditions The aggregate probability of failure of the system Reliability determined by bottoms-up Failure modes and effects analysis Numerical approach Dependable: (not usually used for safety analysis) the trustworthiness of a system which allows reliance to be justifiably placed on the service it delivers (IFIP definition)

    12. The Concept of System Safety System Safety Uses Systems Theory and Systems Engineering Prevent foreseeable accidents Minimize impact of unforeseen events Emphasis on Loss Life and injury, but also, environmental, economic, mission Note that in HAZOP or FHA, the emphasis is on a team of experts performing the core functions. However in System Safety, it is more the practice of Systems Theory and General Systems Engineering performing an analysis using the underlying theories. The experienced domain expert may be relied on to provide the anticipation events, however.Note that in HAZOP or FHA, the emphasis is on a team of experts performing the core functions. However in System Safety, it is more the practice of Systems Theory and General Systems Engineering performing an analysis using the underlying theories. The experienced domain expert may be relied on to provide the anticipation events, however.

    13. Tenants of System Safety Build it in rather than: Build it on or Test it out Considers the whole system, not just elements Takes a large scope on Hazards Not all failures lead to a hazard Not all hazards are caused by failures Emphasis on analysis rather than experience Anticipate how hazards can occur Emphasis on qualitative rather than quantitative

    14. Validation and Verification Validation answers the question, Have we got the requirements right? Major errors in system development are made because we dont get the requirements or the domain description right Issues include: completeness and consistency, conformance to standards, conflicting requirements, errors, ambiguity, can it be built Verification answers the question, Have we implemented a system that satisfies the requirements? The verification effort can take as much of 70% to 80% of the development effort Note that it is in the area of getting the requirements right (including domain description) where many costly mistakes are made. Right means: clear, precise, consistent, unambiguous, complete, ability to build and maintain it, etc. Note that it is in the area of getting the requirements right (including domain description) where many costly mistakes are made. Right means: clear, precise, consistent, unambiguous, complete, ability to build and maintain it, etc.

    15. Modern Examples of High Integrity Systems Airplane fly by wire systems Air traffic control systems Terminal navigation system TCAS Traffic Alert and Collision Avoidance System Automotive Automatic car following cruise control Automotive automatic braking systems Passenger protection systems (New Lexus) Other Transportation Modes Rail and Transit Waterway Industrial Nuclear power plant control system Hydro-electric or fossil power plant control systems Water distribution systems Chemical processing facilities Waste processing facilities Medical Systems Banking transaction systems

    16. Well Known Examples of Failure to Keep Safe Nuclear Three Mile Island Chernobyl Aviation Domestic International Rail Recent Los Angeles Commuter Rail Subway Systems Railway Accidents Highway Systems Bridges Tacoma Narrows Industrial Bhopal, India Chemical Plant Accident Question for the Software Engineer Did software contribute to the accident? Could better use of systems and software help to avoid the accident? These are the two ways in which software is involved in safety systems engineering

    17. North American Aviations Worst

    18. Internationals Worst Air Accidents

    19. AA 191Takeoff From Ohare Under normal circumstances an aircraft losing an engine would be able to fly on the remaining power plants still functioning, When the engine separated, it took a 3 foot section of the wing with, it ripping out vital hydraulic and electric lines. The starboard slats stayed extended but the port slats retracted because of the leaking fluid, causing a stall. The crewwas unaware of theretraction duetothe fact thatthe no.1 generator powered the Captain's instrument panel,and thus theslat disagreementsystem. The stick-shaker had also been disabled. a 10 inch fracture on the rear bulkhead on the pylon. 8 weeks before the accident, the aircraft went through a major check and the self aligning bearings on the bulkhead to wing attachment joints were changed. Normal procedures would involve removing the engine and pylon from the wing separately, by use of a special cradle to lower the engine, but to save on time, a new idea was adapted using a forklift truck to take the whole assembly off as one unit. A combination of issues: design, maintenance, operations, lack of operator, manufacturer, FAA coordination and communication. Usually the case, there are a number of cascading events that result in disasters.A combination of issues: design, maintenance, operations, lack of operator, manufacturer, FAA coordination and communication. Usually the case, there are a number of cascading events that result in disasters.

    20. The last thirty seconds of Tenerif A lot of contributing factors: overly busy airport due to diversions, bomb in the terminal, fog, language confusion Worst air disaster in history 553 deathsA lot of contributing factors: overly busy airport due to diversions, bomb in the terminal, fog, language confusion Worst air disaster in history 553 deaths

    21. Causes of Air Accidents Airplane Design ATC and Navigation Cargo Collisions External Factors Flight Crew Fire Landing and Takeoff Maintenance Result CFIT, emergency, etc. Security Weather Unknown Usually, a combination of factors

    22. Some Observations Accidents in which large numbers of people die are the ones that get our attention DREAD FACTOR But there are typically 100s of accidents in air travel each year in which no deaths occur or a few result Many of these could have ended with a different result A lot of accidents occur over water and out of contact with land Most are attributable to mechanical failure Human error is the root cause in most New generation aircraft are using more computers and offer more opportunity for software failure; yet they are safer Systems and software can impact safety in a number of ways design, support to maintenance and operations, etc.

    23. Keeping things in perspective Commercial Air Transport Travel is safer by the minute than driving to the local market by far The only safer mode of transportation is the elevator! Not the bicycle Not even walking It keeps getting safer by the year, in spite of increased air travel Third Generation aircraft (777,737NG, AB330, etc.) are three times safer than earlier generations Aircraft design Air operations and emphasis on safety But we have the dread factor We arent in control A lot of people go at once The long wait to die Fear of the Unknown

    24. Safety is improving with time Number of years to death at one cross country round trip per week = 33,000. From 1984 to 2003 the mileage flown has doubled (6B in 2003) challenge to keep technology insertion. How to view the trend: straight line or throw out the gross outliers and we have an increase from 84 to early 90s followed by a decrease new technology insertion? IFATC, Terminal Operations, Airplane design?Number of years to death at one cross country round trip per week = 33,000. From 1984 to 2003 the mileage flown has doubled (6B in 2003) challenge to keep technology insertion. How to view the trend: straight line or throw out the gross outliers and we have an increase from 84 to early 90s followed by a decrease new technology insertion? IFATC, Terminal Operations, Airplane design?

    25. Causes of Failure in Safety

    26. Example of Failure in Domain Description Tacoma Narrows

    27. Bhopal, India Disaster 1984, worst industrial disaster in the world 15,000 deaths attributed 100,000s affected long after Caused by the introduction of water into MIC holding tanks. The resulting reaction generated many large surges of toxic gas, forcing the emergency release of pressure. The gas escaped while the chemical 'scrubbers' which should have treated the gas were off-line for repairs. Claimed that several other safety procedures were bypassed [Wikepedia Encyclopedia] A modern computer-based control system should/would have built in safety monitoring features and controls and could have prevented the accident conditions from occurring. Human error was the primary factor. But there were other factors: switch from US to Indian operators, experienced staff quit because of poor conditions, plant being run by EE rather than Chemical Engineer, etc. A Well designed computer management system could have avoided this accident, even with the old physical plant.Human error was the primary factor. But there were other factors: switch from US to Indian operators, experienced staff quit because of poor conditions, plant being run by EE rather than Chemical Engineer, etc. A Well designed computer management system could have avoided this accident, even with the old physical plant.

    28. Three Mile Island Three Mile Island Unit 2 (TMI-2) nuclear power plant near Middletown, Pennsylvania, on March 28, 1979 Most serious nuclear power plant accident in US history Brought about major changes in: Emergency Response Planning Emergency Response Facility Data System (ERFDS) Plant operations training Human Factors Engineering Government regulatory oversight Deployment of nuclear power in the USA The sequence of certain events - - equipment malfunctions, design related problems and worker errors - - led to a partial meltdown of the TMI-2 reactor core but only very small off-site releases of radioactivity. NRC

    29. TMI-2 Plant Diagram

    30. Sequence of Events at TMI-2 4 AM, March 28, 1979, main feedwater pump stops working Turbine shuts down Reactor shuts down Pressure build up in primary section, leads to pilot relief valve opening at top of pressurizer Should have closed after pressure relief but it didnt Indicators failed to show operators that valve still open Excessive loss of water in system Uninformed operators reduced flow of water to the core, resulting in core meltdown After affects Release of radiation from secondary building to relieve pressure on the core Hydrogen bubble buildup in the containment facility Fortunately, no rupture in containment building 1993 14 years later the cleanup was completed, site monitored A cascading of errors led to the hazard and accident. Some of the after affects were: 1. the imposition of the ERFDS an independent monitoring and response system. 2. more attention to the hazards assessment and use of Probabilistic Risk Assessment (PRA) like being used by NASA for manned space flight.A cascading of errors led to the hazard and accident. Some of the after affects were: 1. the imposition of the ERFDS an independent monitoring and response system. 2. more attention to the hazards assessment and use of Probabilistic Risk Assessment (PRA) like being used by NASA for manned space flight.

    31. Modern Rail Accidents AMAGASAKI, Japan 52 died, 417 injured April 25, 2005 Worst rail accident in Japan in 40 years

    32. Ways of Assessing Risk System Safety Assessment (SSA) Failure Modes, Effects and Criticality Analysis (FMEA/FMECA) Probabilistic Risk Assessment (PRA)

    33. Probabilistic Risk Assessment PRA as an analytical tool includes consideration of the following: Identification and delineation of the combinations of events that, if they occur, could lead to an accident (or other undesired event); Estimation of the chance of occurrence for each combination; and Estimation of the consequences associated with each combination. In Nuclear; focuses on damage to reactor core and containment facility Applied to the total fuel cycle Four questions answered by PRA (NASA Study results) 1. What can go wrong, or what are the initiators or initiating events (undesirable starting events) that lead to adverse consequences? 2. What and how severe are the potential adverse consequences that the technological entity and the extended environment on the crew may be eventually subjected to as a result of the occurrence of the initiator? 3. How likely to occur are these undesirable consequences, or what are their probabilities or frequencies? 4. How confident are we about our answers to the above questions?

    34. Issues for the Systems and Software Professional Ethical Issues Our obligations to exhibit our concern for safety when engineering a system Our personal roles in raising safety related issues during project reviews Our obligations when we believe a system presents an unacceptable level of risk Basis of our perceptions Our Role Professional Issues Our professional obligations Our professional society obligations Our legal obligations The role of independent assessment of system safety Removes conflicts in the development organization Independent assessment must be funded, staffed and empowered Internal and external independence In the FAA, role of the DER For civil aviation, FAA conducts independent, in-process audits

    35. Specifying High Integrity Systems Domain Description A system perspective A software perspective Requirements Understanding Stakeholder Needs Who are the stakeholders Getting stakeholder requirements Bridging the gap between needs and The Specification

    36. Domain Description Specifications should describe the domain explicitly; they should distinguish domain properties that are independent of the system from those that the system is required to enforce. An ordinary domain description is in the indicative mood: it asserts certain truths about the domain. A requirement, on the other hand, while describing the domain, is in the optative mood. It describes the desired state of affairs that the machine produces

    37. Software Requirements and Specificaitons To develop software is to build a Machine, simply by describing it. E.g., software development is engineering Application Domain the parts of the world that will affect the machine and will be affected by it The problem is in the application domain, the machine is the solution The Application Domain must be explicitly and precisely described

    38. The Domain is separate from the Machine

    39. Simple Example of Domain Error APPLICATION OF THRUST REVERSERS UPON LANDING Requirement REVERSE_ENABLED if an only if MOVING_ON_RUNWAY WRONG Software Specification REVERSE_ENABLED if an only if WHEEL_PULSES_ON Domain Property: WHEEL_PULSES_ON if and only if WHEELS_TURNING NOT Domain Property: WHEEL_PULSES_ON if and only if MOVING_ON_RUNWAY The problem was one of domain error; i.e., water on the runway and aquaplaning DOOR LOCKING MECHANISMS DURING CRASH AND ENGINES RUNNING AUTOMATIC BRAKING SYSTEM ON AUTOMOBILE

    40. Safety Stakeholders Need to separate each stakeholders safety requirements level of acceptable risk Understand safety requirements in relationship to other system needs For example; a weapon system such as a helicopter has acceptable levels of risk a lot higher than a commercial airliner Resolve safety requirements among stakeholders early on Reach Agreement among Stakeholders and between Stakeholders and System Implementers

    41. The Satisfaction Argument

    42. General Methods for Safety Analysis Safety Case HAZOP FMECA Fault Tree Analysis Numerical Methods Industrial Standards of the IEEE, IEC, ANSI, SAE, and others Government Standards Probabilistic Reliability Analysis (PRA)

    43. Industry Specific Methods Commercial Aviation SAE ARP 4754 SAE ARP 4761 RTCA DO 178B and 254 FAA Software Mega Order Military MIL STD 882, UK MOD DEF STAN 00-55 and 00-56 NASA and Space Nuclear European Safety Standard IEC 61508 Medical

    44. Safety Case UK MOD: A structured argument, supported by a body of evidence that provides a compelling, comprehensible and valid case that a system is safe for a given application in a given operating environment. UK Railroad The safety case documents must set out how the rail operators will manage and control the health and safety of staff and the public and their contingency plans for dealing with emergencies and other abnormal situations. This includes: safety policy and objectives a risk assessment safety management systems risk control measures Industrial source A safety case is a comprehensive, written justification that a system or operation will be safe throughout its lifecycle from inception to eventual decommissioning. The Safety Case is the integration of arguments and evidence that describe, quantify and substantiate the safety, and the level of confidence in the safety, of a facility or activity.

    45. Goal Structured Notation Components

    46. Using the GSN Notation

    47. An Example of GSN

    48. UK Ministry of Defense and Risk Management Hazard Identification. Hazard Analysis. Risk Estimation. Risk and ALARP Evaluation. Risk Reduction. Risk Acceptance.

    49. HAZOP The Hazard and Operability Study, known as HAZOP, is a standard hazard analysis technique used in the preliminary safety assessment of new systems or modifications to existing ones. The HAZOP study is a detailed examination, by a group of specialists, of components within a system to determine what would happen if that component were to operate outside its normal design mode. Each component will have one or more parameters associated with its operation such as pressure, flow rate or electrical power. The HAZOP study looks at each parameter in turn and uses guide words to list the possible off-normal behavior such as 'more', 'less', 'high', 'low' or 'no'. The effects of such behavior is then assessed and noted down on study forms. The categories of information entered on these forms can vary from industry to industry and from company to company

    50. SAE ARP 4761 Overview Safety Assessment Process Functional Hazards Assessment (FHA) Preliminary System Safety Assessment (PSSA) System Safety Assessment (SSA) Safety Assessment Analysis Methods Fault Tree Analysis/Dependency Diagrams/Markov Analysis Failure Modes and Effects Analysis (FMEA) Failure Modes and Effects Summary Common Cause Analysis (CCA) Zonal Safety Analysis (ZSA) Particular Risk Analysis Common Mode Analysis

    51. FHA Conducted at the beginning of development cycle Identifies and classifies failure conditions associated with aircraft functions and combinations of functions Classification Minor (Level 1 or D) Major Severe Catastrophic (Level 4 or A) These lead to the establishment of safety objectives

    52. PSSA PRELIMINARY SYSTEM SAFETY ASSESSMENT Systematic examination of the proposed system architecture to determine how failures can cause the functional hazards identified in the FHA Objective is to establish the safety requirements of the system and verify that the proposed architecture can reasonably be expected to meet the objectives identified in the FHA Usually takes the form of a Fault Tree Assessment (FTA) and includes Common Cause Analysis (can use Dependency Diagrams or Markov Analysis) Includes hardware and software failures

    53. SSA A systematic, comprehensive evaluation of the implemented system to show that the safety objectives of the FHA and derived safety requirements of the PSSA are met. Usually based on the FTA of the PSSA (may use DD or MA) Uses quantitative results of the FMES

    54. FMEA Systematic, bottoms-up method of identifying the failure modes of the system, item or function Determines the effects on the next higher level Software can be analyzed qualitatively as part of FMEA Typically used to analyze failure effects from single point failures

    55. CCA Common Cause Analysis Includes Zonal Safety Analysis (installation, interference, maintenance) Particular Risk Analysis Fire High energy devices Leaking fluids Hail, Ice, Snow, Etc. Common Mode Analysis Hardware Error Software Error (multiple, identical software) Etc.

    56. Commercial Tools Available A large and growing number of software-based tools are available to assist in safety analysis Isograph Reliability WorkBench, SAIC CAFTA, Item Software (particularly for PRA), etc. Government tools from NASA and NRC The Key Ingredients to Safety Analysis requires Domain Expertise Practical Experience Knowledge of the processes and tools

More Related