Automating Misconfiguration Troubleshooting with PeerPressure: Enhancing PC Tech Support

Automatic Misconfiguration Troubleshooting with PeerPressure Helen J. Wang, John C. Platt, Yu Chen, Ruyun Zhang, Yi-Min Wang Microsoft Research Presenter: Sara Salahi Northwestern University

Agenda • Importance of this work • Key ideas • PeerPressure: Architecture & Algorithm • Prototype • Performance • Future Work

Authors focus on this Importance • Tech support = 17% total cost of ownership of today’s desktop PCs • Large amount of Tech support is spent on troubleshooting • Many troubleshooting cases are due to misconfiguration • Misconfiguration is often caused by data that is in shared persistent stores (e.g. Windows registry)

Key Ideas: Misconfigurations • Can have many different “root causes” • Seemingly innocuous changes to shared system configurations • System bugs • Security patches may introduce incompatible registry settings • Failed uninstallation of applications • Manual intervention using Registry editor

Key Ideas: The Golden State • “Golden State” – a perfect configuration • Assume that the golden state is in the mass • Combine statistical golden state with Bayesian statistics to identify anomalous misconfigurations on “sick” machines

Key Ideas: Goals of Troubleshooting • Effectiveness • System should identify a small set of sick configuration candidates in a short amount of time • Automation • Minimize number of manual steps and number of users involved

3) Turns user- or machine-specific entries into canonicalized form 2) I found you  1) Sick computer  4) Database containing a number of machine configuration snapshots 5) Bayesian estimation used to calculate probability of a suspect being sick PeerPressure: Architecture

PeerPressure: Architecture • Manual Steps • User runs faulty application to record suspects • User determines if sickness is cured • Manual steps involve only the troubleshooting user and no second-party

PeerPressure: Algorithm • Intuition and Objectives • e1: Probably healthy • e2: Most probably sick • e3: “Natural biological diversity” • Type I: application configuration states • e1 and e2 • Type II: operational states (timestamps, caches etc) • e3 • Want to weed out; most likely false positives

PeerPressure: Algorithm Formulation: • (3) + (1)  when m=0, P(S|V) = 1 • Bayesian estimation used to overcome this. • Vector pj: probability of event happening and its outcome being Vj; pj follows Direchtlet distribution. • mj: count of number of values matching suspect value

PeerPressure: Algorithm Asymptotic Analysis:

Prototype • GeneBank Database: Microsoft SQL Server 2000 containing snapshots from 87 Windows XP PCs • PeerPressure troubleshooter implemented in C# • “Data Sanitization” • Unification of different representations of the same value • Dual Intel Xeon 2.4 GHz CPU workstation with 1 Gb RAM hosts SQL Server

PerformanceResponse Time vs. Number of Suspects • 20 real-world troubleshooting cases used • Database queries dominate troubleshooting response time (one query per suspect entry)

Prototype: GeneBank • Registry characteristics in GeneBank • Unseen – values that are unknown to the GeneBank, increments observed cardinality by 1 • Any entry from GeneBank has cardinality of at least 2 • Entries that do no exist on some sample machines have value no entry • When cardinality is low, conformity among samples is strong

PerformanceRoot-Cause Ranking Results • 87% have cardinality of 2, 94% no more than 3, 97% no more than 4

PerformanceFalse Positives • Large cardinality of root-cause entry • Relation between root-cause entry and other entries in the suspect set • GeneBank is not pristine

PerformanceImpact of Sample Set Size

PerformanceSick Machine Sensitivity Format: RootCauseRanking (NumberOfTies) / NumberOfSuspects

Future Work • Multi-gene troubleshooting • Multiple sick entries among suspects • Cross-application misconfiguration • Heavy customization of apps can break assumption of strong conformance in most configuration entries • GeneBank maintenance – privacy issue

Automating Misconfiguration Troubleshooting with PeerPressure: Enhancing PC Tech Support

Automating Misconfiguration Troubleshooting with PeerPressure: Enhancing PC Tech Support

Presentation Transcript

Troubleshooting

Automatic Forecasting with R

Reliability and Troubleshooting with Condor

Troubleshooting

TROUBLESHOOTING

TROUBLESHOOTING

TROUBLESHOOTING

Troubleshooting

Troubleshooting

Troubleshooting

Troubleshooting Fittings with iPFG

Understanding BGP Misconfiguration

Troubleshooting

Troubleshooting

Troubleshooting Problems With Lexmark Printer

Troubleshooting

Troubleshooting

Troubleshooting

Reliability and Troubleshooting with Condor