On th e road to Self-Healing Datacenters

On the road to Self-Healing Datacenters Moises GoldszmidtMicrosoft Research Dagstuhl Meeting

Two approaches – many challenges Fingerprinting -- Performance Autopilot -- Availability Set of “sensors” to signal events A reactive mapping from events to repair action Optimize parameters Identify *real* problem states Identify optimal action • Identify problem states • Identify optimal actions • Build a mapping between signals to action How?

From raw data to 0-time diagnosis and repair Fingerprinting

Performance crises diagnosis What? Where? How?  Manual search throughhundreds of machines withhundreds of metrics each Crisis  20% of the machines above threshold What? Where? How? Key Performance Indicator Alerts!!

Fingerprinting IF positive ID then time to repair action = 0ELSE reduce time to diagnosis and amortize $$ Fingerprint Look for a match Fingerprints database

Remarks on the value of a fingerprint • Why do crises repeat often (enough)? • Fixing the root cause of problems is often not the optimal option – prediction and recovery as a feasible alternative • External dependencies • Software bugs will be fixed in next release (maybe?) • Federation • Hardware providers • Other services dependencies • Diagnosis (?) • Human interpretable summary of what is “abnormal” during a crisis • Creating a searchable database of problems • Representing state for self-healing…

What’s in a fingerprint? Actual fingerprint • Summarization across machines – quantiles • Summarization across time – hot/cold thresholds • - Time series, fixed thresholds…. • Metric selection – which are the relevant metrics • - Model the crises  Logistic regression with L1 regularization

Pattern matching Use L2 to compute the distance between the vectors 4.75 1.53

Results on19 labeled crises Discriminative power fingerprints 0.994All metrics 0.873 Just KPIs 0.854 SOSP05 0.876 Operational setting Results on 20 random sequences

From matching the fingerprintto self-healing We are effectively and accurately retrieving past incidents -- why not complete the cycle and get self-healing in? Record actions taken Model fingerprint evolution Optimization problem

Challenges • What are the appropriate models for • Crises evolution • Actions and their effects • How can we perform experiments without disrupting operations? • How do we evaluate the effectiveness of the approach?

How good is my reboot? Should I call a human to repair or get a new machine? Autopilot

Recovery Oriented Computing • Computers are going to fail • By a simple statistical argument the frequency will increase in datacenters • In order to maintain availability efficiently, make sure that you implement cheap ways of recovery

Autopilot “Watchdogs” = active sensors H H P Events P F RB F P D P Automaton H Reboot Actuators Reimage

Challenges • What is the effectiveness of a reboot? • How can use data to determine whether certain watchdogs indicate that certain actions are meaningless? • How can I incorporate optimal decision making into the supply chain?

Apply survival analysis… P(T>t) after the third reboot Is it worth toreboot a machinefor a third time if only P(t>24h) < 30%?

You need a better analysis in some Is it worth tocalling a humanif P(t>24h) < 50%?

Finding a model Logistic “Easier” to find a high accuracy model Regression Add L1 Regularization

Models as a basis for decision making # selected metrics: 3 lambda: 0.052734375 CV BA: 1.000 CV confusion matrix: below above pred below 18 1 pred above 0 12 coeffsindBA threshold norm.thr. intercept -0.065 duration0 1.497 0.972 444,266.00 0.147 e8060 0.871 0.840 0.00 -0.762 e8383 -0.469 0.669 0.00 -0.416 (log)odds of staying alive for another 24hrs = 1.5*duration + 0.9*e8060 – 0.5*e8383 – 0.07 Policy -> Optimize based on combining the odds with cost of sending a repair person

Summary/conclusions • Statistics and probability are your friends • Quantiles (for sumarization) • Time series (for extreme values) • Regression (modeling) • Estimators (P(T>t)) • Models, models, models • For finding out relevant metrics • For estimating the odds of events based on past history • Discussion: • Automated experiment generation and Causality

Thank you… Questions? Acks: Peter Bodik Armando Fox Hans Andersen Mihai Budiu Yue Zhan

On th e road to Self-Healing Datacenters

On th e road to Self-Healing Datacenters

Presentation Transcript

Self-healing networks

Self-healing networks

Self-Healing SQL Servers

The Road to Healing

Healing the Shadow Self

Self-Healing in Wireless Networks

Modular Datacenters

ROMP Healing Agent Development for Self-Healing Materials

Self-healing thermoplastic elastomers

Resiliency and self-healing

Wireless Sensor Networks Self-Healing

Self-healing Concrete

The Road to Healing

The Royal Road to Healing

AstroGrid Datacenters

Self-Healing Grid Industry to 2023

Self-Healing Materials Market

Self Healing Grid Market

Resiliency and self-healing

Self-healing Software Systems

Healing the Shadow Self

Self-healing through Soul Adventure