1 / 76

Say Goodbye to Post Mortems Say Hello to Effective Problem Management

Say Goodbye to Post Mortems Say Hello to Effective Problem Management. Charles T. Foy Siemens Medical Solutions USA, Inc. Health Services Division charles.foy@siemens.com. Company: Siemens, AG. Our division: healthcare software Our department: application hosting

devona
Télécharger la présentation

Say Goodbye to Post Mortems Say Hello to Effective Problem Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Say Goodbye to Post MortemsSay Hello toEffective Problem Management • Charles T. Foy • Siemens Medical Solutions USA, Inc. • Health Services Division • charles.foy@siemens.com

  2. Company: Siemens, AG • Our division: healthcare software • Our department: application hosting • Mainframe, mid-range, open systems, distributed systems • All operating systems (except Tandem) • My role

  3. Caution! This company founded by former employees of International Business Machines (IBM) Proclivity for acronyms is part of the culture. Proclivity: “a natural or habitual inclination or tendency; propensity; predisposition” You have been warned… Acronym Alert!

  4. Agenda • What drove creation of a Problem Management System? • First steps • Give it a name? • Got Lucky! • Build versus Buy • It’s a Defect! • What to track? Classifications? • Database Structure • The Process • Trending • Benefits

  5. First Steps

  6. What drove creation of a Problem Management System? • Disparate, inconsistent ‘post-mortems’ • Usually driven by customer demand for an explanation • Needed a defined process • Consistent across the company • Communicates to the customer – internal and external

  7. First StepsLaunch: • Assigned to a small group • Two service delivery managers • One consultant (employee #26) • Quality Assurance and Process Definition expert • No detailed marching orders other than “standard post-mortem process”

  8. Root Cause Root Cause Standardized Text Document Follow-up Standardized Text Document Database Root Cause Field Follow up Field Document First StepsStarted with…. Standardized Text Document

  9. First StepsDefined our own goal: Redefined project outcomes: • reduces unscheduled outages • increases availability • communicates the root cause and preventive measures implemented to internal and external audiences Has to: • Drive to the root cause • In a searchable manner, track: • outage details • root causes, corrective actions • customer communications • preventive measures implementation status • etc

  10. First StepsGive it a Name? Needed a new name • no longer a “Post Mortem” process • “Post Mortem” didn’t sit well • Before fully ITIL-aware How about a working title for our project? • Perhaps the Post Event Analysis Process, a.k.a. PEAP? • Always change it later on Acronym Alert! Never Happened!

  11. Thus, PEAP was born!

  12. And if the Post Event Analysis Process produces a Report, it of course would be called….

  13. First StepsPost Mortem Report new name: The Post Event Analysis Report Or PEAR Acronym Alert!

  14. Define the database and process Database needs: • Description, short term resolution, root cause • Customers impacted, length of outage • Corrective actions implemented & their status • Etc. Process: • Capture the root cause • Ensure the corrective action was implemented • Communicate all the above Seemed straightforward, linear, one to one…

  15. We Got Lucky!

  16. Next Steps – define the database requirementsWe Got Lucky! Ran into a friend… • Provided us with an excellent service outage to use as our model • Decided to use it as proof of concept Slowdown affecting almost all his applications, Response time dropped to zero within 5 minutes… Started looking for commonalities – network was suspect • A Configuration Management Database (CMDB) would have helped! Started looking like it was the Storage Area Network (SAN) Acronym Alert! Problem cleared up, 45 minutes into the event

  17. The Outage Incident • Look up - Jake San Technician • Fixes the problem! • Not! • Battery Swap! • 45 minutes ago, looks good! • Here’s what happened…

  18. Root cause: Battery was going to go bad and was swapped out. • So Hardware is the root cause But wait…is it really a Hardware issue? • Battery didn’t actually die… it was Jake San Technician! Human Error! But wait…is it really a “Human Error” issue? • Jake doing his job OK, a… “Rules” issue – “always swap batteries off peak”

  19. Root cause? Aren’t these ‘contributing’ root causes? • They didn’t know the battery was alerting • SAN vendor knew • SAN technician walked in and worked without their knowledge • SAN technician education • Data center employees education • No battery swap rule/process

  20. Root cause? What would we put as our root cause? Do we need to track all these ‘root’ causes? Do we need to track the corrective actions for each? Don’t most outages have multiple root causes?

  21. Conclusion: MULTIPLE root causes Multiple root causes, multiple follow-ups. This would be complex.

  22. Build it? Buy it?

  23. Build a database? Designed requirements, got a resource time estimate • Presented to upper management • Anything on the shelf? • Tools and Methodology Manager: • Hardware that breaks • Software that breaks • Humans that make errors… Essentially, you’re tracking defects!

  24. Defect Tracking Company standard defect tracking application • Fully implemented and operational Subject Matter Expert (SME) • Does 90% of what you need • Easy to implement • What are your major defect categories? Acronym Alert!

  25. Defect Tracking To build this, you need Classifications…. What are your major defect areas? How granular?

  26. What to track?

  27. The Classifications Asked our peers • Specific type of hardware • Specific type of software • Human error

  28. The Classifications How much detail? • Major category (hardware) • The thing that broke (server) • Thing that caused it to break (bad power supply) • Model that broke (Fleetwood XL340)

  29. Human Error Does that work for Human Error? Example: Jeff mistyped a static route in a backup router. Primary router fails. Backup router kicks in but does not recover all the interfaces… • Major category (human error) • The thing that broke (typing) • Thing that caused it to break (not enough sleep) • Model that broke (Jeff)

  30. Human Error? • Do we really want to say “human error”? • What does it mean to make a human error? • Failure To Follow A Process? …FTFAP Eureka! A five letter acronym! Acronym Alert!

  31. Classifications Euphemism at first, then… The “Process” category was born! • Process Not Followed (a.k.a. Human Error) • Process Incomplete • Process Incorrect (covers the “need to change the Rule” root cause) • Documentation wrong

  32. More items to track • Version and vendor of the software/hardware? • Name of the Human? • Impacted application(s)? Impacted customer(s)? • O/S level?, 3rd party software, something we wrote? • Was this tested before it was put into production? • Did it happen before? • What is the air-speed velocity of an unladen swallow?

  33. Database Structure

  34. Database StructureSupports Multiple Levels of Classification Global Keyword: allows for over-all groupings • Hardware • Software • Process Keyword 1 answers “What broke?” • Answer: Server Keyword 2 answers “What thing within KW1 broke?” • Answer: Power Supply

  35. Keyword Grouping Samples Hardware Keyword 1 Keyword2 Server Cable CPU Hard Drive HBA Memory MthrBoard Pwr Supply Keyword 1 Keyword2 Router Chassis Memory Nic Card NPE Pwr Supply

  36. Keyword Grouping Samples Software Keyword 1 Keyword2 Server BIOS Term Svcs DHCP Firewall IIS LDAP Virus-Wm Keyword 1 Keyword2 Application A Print Subsys GSM RSA Service Pack CICS Configuration Dayend Flow MODS PTF

  37. Keyword Grouping Samples Process Keyword 1 Keyword2 Process Incomplete Process Incorrect Process Not Follow Documentation Incorrect

  38. Database Structure

  39. Database Structure All root causes and keywords

  40. Database Structure All root causes and follow-ups

  41. Process Definition

  42. The ProcessWho will own the process? Owner? • PEAP Owner role? (PO?) We need action in the title… • PEAP Driver (PD?) How about a PEAP Owner/Driver? A POD! Acronym Alert!

  43. The ProcessPOD role ID all root causes Describe Preventive Action

  44. The ProcessAssign follow ups…

  45. The ProcessDocument and Communicate • Document all in the database • Communicate: • Internally • Externally • Drive the process to completion

  46. Surprisingly, nobody wants to be a POD! Actually a good thing… If your area contributed or caused an outage, you get to be POD. Incentive not to have outages

  47. The Process - details to work out • How to define an outage? • When is the outage over? • Who is best to drive this process? • How does the process get initiated?

  48. The Process - details Existing Outage Management Process • Existing outage definition • Knowledge of incident • Communicates incident status to customers Eureka! • Outage Manager can launch PEAP • Assign POD = manager of group that fixed the outage

  49. The Process ITIL Terminology: • Incident: Any event that is not part the standard operation of a service and that causes, or may cause, an interruption to, or a reduction in, the quality of that service. • Problem: unknown underlying cause of one of more incidents. -from ITIL Foundations by ITpreneurs B.V. 2006 At the end of the Incident Management process, the item is moved to the Problem Management Process

More Related