760 likes | 883 Vues
Say Goodbye to Post Mortems Say Hello to Effective Problem Management. Charles T. Foy Siemens Medical Solutions USA, Inc. Health Services Division charles.foy@siemens.com. Company: Siemens, AG. Our division: healthcare software Our department: application hosting
E N D
Say Goodbye to Post MortemsSay Hello toEffective Problem Management • Charles T. Foy • Siemens Medical Solutions USA, Inc. • Health Services Division • charles.foy@siemens.com
Company: Siemens, AG • Our division: healthcare software • Our department: application hosting • Mainframe, mid-range, open systems, distributed systems • All operating systems (except Tandem) • My role
Caution! This company founded by former employees of International Business Machines (IBM) Proclivity for acronyms is part of the culture. Proclivity: “a natural or habitual inclination or tendency; propensity; predisposition” You have been warned… Acronym Alert!
Agenda • What drove creation of a Problem Management System? • First steps • Give it a name? • Got Lucky! • Build versus Buy • It’s a Defect! • What to track? Classifications? • Database Structure • The Process • Trending • Benefits
What drove creation of a Problem Management System? • Disparate, inconsistent ‘post-mortems’ • Usually driven by customer demand for an explanation • Needed a defined process • Consistent across the company • Communicates to the customer – internal and external
First StepsLaunch: • Assigned to a small group • Two service delivery managers • One consultant (employee #26) • Quality Assurance and Process Definition expert • No detailed marching orders other than “standard post-mortem process”
Root Cause Root Cause Standardized Text Document Follow-up Standardized Text Document Database Root Cause Field Follow up Field Document First StepsStarted with…. Standardized Text Document
First StepsDefined our own goal: Redefined project outcomes: • reduces unscheduled outages • increases availability • communicates the root cause and preventive measures implemented to internal and external audiences Has to: • Drive to the root cause • In a searchable manner, track: • outage details • root causes, corrective actions • customer communications • preventive measures implementation status • etc
First StepsGive it a Name? Needed a new name • no longer a “Post Mortem” process • “Post Mortem” didn’t sit well • Before fully ITIL-aware How about a working title for our project? • Perhaps the Post Event Analysis Process, a.k.a. PEAP? • Always change it later on Acronym Alert! Never Happened!
And if the Post Event Analysis Process produces a Report, it of course would be called….
First StepsPost Mortem Report new name: The Post Event Analysis Report Or PEAR Acronym Alert!
Define the database and process Database needs: • Description, short term resolution, root cause • Customers impacted, length of outage • Corrective actions implemented & their status • Etc. Process: • Capture the root cause • Ensure the corrective action was implemented • Communicate all the above Seemed straightforward, linear, one to one…
Next Steps – define the database requirementsWe Got Lucky! Ran into a friend… • Provided us with an excellent service outage to use as our model • Decided to use it as proof of concept Slowdown affecting almost all his applications, Response time dropped to zero within 5 minutes… Started looking for commonalities – network was suspect • A Configuration Management Database (CMDB) would have helped! Started looking like it was the Storage Area Network (SAN) Acronym Alert! Problem cleared up, 45 minutes into the event
The Outage Incident • Look up - Jake San Technician • Fixes the problem! • Not! • Battery Swap! • 45 minutes ago, looks good! • Here’s what happened…
Root cause: Battery was going to go bad and was swapped out. • So Hardware is the root cause But wait…is it really a Hardware issue? • Battery didn’t actually die… it was Jake San Technician! Human Error! But wait…is it really a “Human Error” issue? • Jake doing his job OK, a… “Rules” issue – “always swap batteries off peak”
Root cause? Aren’t these ‘contributing’ root causes? • They didn’t know the battery was alerting • SAN vendor knew • SAN technician walked in and worked without their knowledge • SAN technician education • Data center employees education • No battery swap rule/process
Root cause? What would we put as our root cause? Do we need to track all these ‘root’ causes? Do we need to track the corrective actions for each? Don’t most outages have multiple root causes?
Conclusion: MULTIPLE root causes Multiple root causes, multiple follow-ups. This would be complex.
Build a database? Designed requirements, got a resource time estimate • Presented to upper management • Anything on the shelf? • Tools and Methodology Manager: • Hardware that breaks • Software that breaks • Humans that make errors… Essentially, you’re tracking defects!
Defect Tracking Company standard defect tracking application • Fully implemented and operational Subject Matter Expert (SME) • Does 90% of what you need • Easy to implement • What are your major defect categories? Acronym Alert!
Defect Tracking To build this, you need Classifications…. What are your major defect areas? How granular?
The Classifications Asked our peers • Specific type of hardware • Specific type of software • Human error
The Classifications How much detail? • Major category (hardware) • The thing that broke (server) • Thing that caused it to break (bad power supply) • Model that broke (Fleetwood XL340)
Human Error Does that work for Human Error? Example: Jeff mistyped a static route in a backup router. Primary router fails. Backup router kicks in but does not recover all the interfaces… • Major category (human error) • The thing that broke (typing) • Thing that caused it to break (not enough sleep) • Model that broke (Jeff)
Human Error? • Do we really want to say “human error”? • What does it mean to make a human error? • Failure To Follow A Process? …FTFAP Eureka! A five letter acronym! Acronym Alert!
Classifications Euphemism at first, then… The “Process” category was born! • Process Not Followed (a.k.a. Human Error) • Process Incomplete • Process Incorrect (covers the “need to change the Rule” root cause) • Documentation wrong
More items to track • Version and vendor of the software/hardware? • Name of the Human? • Impacted application(s)? Impacted customer(s)? • O/S level?, 3rd party software, something we wrote? • Was this tested before it was put into production? • Did it happen before? • What is the air-speed velocity of an unladen swallow?
Database StructureSupports Multiple Levels of Classification Global Keyword: allows for over-all groupings • Hardware • Software • Process Keyword 1 answers “What broke?” • Answer: Server Keyword 2 answers “What thing within KW1 broke?” • Answer: Power Supply
Keyword Grouping Samples Hardware Keyword 1 Keyword2 Server Cable CPU Hard Drive HBA Memory MthrBoard Pwr Supply Keyword 1 Keyword2 Router Chassis Memory Nic Card NPE Pwr Supply
Keyword Grouping Samples Software Keyword 1 Keyword2 Server BIOS Term Svcs DHCP Firewall IIS LDAP Virus-Wm Keyword 1 Keyword2 Application A Print Subsys GSM RSA Service Pack CICS Configuration Dayend Flow MODS PTF
Keyword Grouping Samples Process Keyword 1 Keyword2 Process Incomplete Process Incorrect Process Not Follow Documentation Incorrect
Database Structure All root causes and keywords
Database Structure All root causes and follow-ups
The ProcessWho will own the process? Owner? • PEAP Owner role? (PO?) We need action in the title… • PEAP Driver (PD?) How about a PEAP Owner/Driver? A POD! Acronym Alert!
The ProcessPOD role ID all root causes Describe Preventive Action
The ProcessDocument and Communicate • Document all in the database • Communicate: • Internally • Externally • Drive the process to completion
Surprisingly, nobody wants to be a POD! Actually a good thing… If your area contributed or caused an outage, you get to be POD. Incentive not to have outages
The Process - details to work out • How to define an outage? • When is the outage over? • Who is best to drive this process? • How does the process get initiated?
The Process - details Existing Outage Management Process • Existing outage definition • Knowledge of incident • Communicates incident status to customers Eureka! • Outage Manager can launch PEAP • Assign POD = manager of group that fixed the outage
The Process ITIL Terminology: • Incident: Any event that is not part the standard operation of a service and that causes, or may cause, an interruption to, or a reduction in, the quality of that service. • Problem: unknown underlying cause of one of more incidents. -from ITIL Foundations by ITpreneurs B.V. 2006 At the end of the Incident Management process, the item is moved to the Problem Management Process