1 / 21

On th e road to Self-Healing Datacenters

On th e road to Self-Healing Datacenters. Moises Goldszmidt Microsoft Research Dagstuhl Meeting. Two approaches – many challenges. Fingerprinting -- Performance. Autopilot -- Availability. Set of “sensors” to signal events A reactive mapping from events to repair action

vinaya
Télécharger la présentation

On th e road to Self-Healing Datacenters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On the road to Self-Healing Datacenters Moises GoldszmidtMicrosoft Research Dagstuhl Meeting

  2. Two approaches – many challenges Fingerprinting -- Performance Autopilot -- Availability Set of “sensors” to signal events A reactive mapping from events to repair action Optimize parameters Identify *real* problem states Identify optimal action • Identify problem states • Identify optimal actions • Build a mapping between signals to action How?

  3. From raw data to 0-time diagnosis and repair Fingerprinting

  4. Performance crises diagnosis What? Where? How?  Manual search throughhundreds of machines withhundreds of metrics each Crisis  20% of the machines above threshold What? Where? How? Key Performance Indicator Alerts!!

  5. Fingerprinting IF positive ID then time to repair action = 0ELSE reduce time to diagnosis and amortize $$ Fingerprint Look for a match Fingerprints database

  6. Remarks on the value of a fingerprint • Why do crises repeat often (enough)? • Fixing the root cause of problems is often not the optimal option – prediction and recovery as a feasible alternative • External dependencies • Software bugs will be fixed in next release (maybe?) • Federation • Hardware providers • Other services dependencies • Diagnosis (?) • Human interpretable summary of what is “abnormal” during a crisis • Creating a searchable database of problems • Representing state for self-healing…

  7. What’s in a fingerprint? Actual fingerprint • Summarization across machines – quantiles • Summarization across time – hot/cold thresholds • - Time series, fixed thresholds…. • Metric selection – which are the relevant metrics • - Model the crises  Logistic regression with L1 regularization

  8. Pattern matching Use L2 to compute the distance between the vectors 4.75 1.53

  9. Results on19 labeled crises Discriminative power fingerprints 0.994All metrics 0.873 Just KPIs 0.854 SOSP05 0.876 Operational setting Results on 20 random sequences

  10. From matching the fingerprintto self-healing We are effectively and accurately retrieving past incidents -- why not complete the cycle and get self-healing in? Record actions taken Model fingerprint evolution Optimization problem

  11. Challenges • What are the appropriate models for • Crises evolution • Actions and their effects • How can we perform experiments without disrupting operations? • How do we evaluate the effectiveness of the approach?

  12. How good is my reboot? Should I call a human to repair or get a new machine? Autopilot

  13. Recovery Oriented Computing • Computers are going to fail • By a simple statistical argument the frequency will increase in datacenters • In order to maintain availability efficiently, make sure that you implement cheap ways of recovery

  14. Autopilot “Watchdogs” = active sensors H H P Events P F RB F P D P Automaton H Reboot Actuators Reimage

  15. Challenges • What is the effectiveness of a reboot? • How can use data to determine whether certain watchdogs indicate that certain actions are meaningless? • How can I incorporate optimal decision making into the supply chain?

  16. Apply survival analysis… P(T>t) after the third reboot Is it worth toreboot a machinefor a third time if only P(t>24h) < 30%?

  17. You need a better analysis in some Is it worth tocalling a humanif P(t>24h) < 50%?

  18. Finding a model Logistic “Easier” to find a high accuracy model Regression Add L1 Regularization

  19. Models as a basis for decision making # selected metrics: 3 lambda: 0.052734375 CV BA: 1.000 CV confusion matrix: below above pred below 18 1 pred above 0 12 coeffsindBA threshold norm.thr. intercept -0.065 duration0 1.497 0.972 444,266.00 0.147 e8060 0.871 0.840 0.00 -0.762 e8383 -0.469 0.669 0.00 -0.416 (log)odds of staying alive for another 24hrs = 1.5*duration + 0.9*e8060 – 0.5*e8383 – 0.07 Policy -> Optimize based on combining the odds with cost of sending a repair person

  20. Summary/conclusions • Statistics and probability are your friends • Quantiles (for sumarization) • Time series (for extreme values) • Regression (modeling) • Estimators (P(T>t)) • Models, models, models • For finding out relevant metrics • For estimating the odds of events based on past history • Discussion: • Automated experiment generation and Causality

  21. Thank you… Questions? Acks: Peter Bodik Armando Fox Hans Andersen Mihai Budiu Yue Zhan

More Related