A Statistical Learning Approach to Diagnosing eBay’s Site

A Statistical Learning Approach to Diagnosing eBay’s Site Mike Chen, Alice Zheng, Jim Lloyd, Michael Jordan, Eric Brewer mikechen@cs.berkeley.edu

Motivation • Fast failure detection and diagnosis are critical to high availability • But, exact root cause may not be required for many recovery techniques • Many potential causes of failures • Software bugs, hardware, configuration, network, database, etc. • Manual diagnosis is slow and inconsistent • Statistical approaches are ideal • Simultaneously examining many possible causes of failures • Robust to noise Path-basedDiagnosis

Challenges • Lots of (noisy) data • Near real-time detection and diagnosis • Multiple independent failures • Root cause might not be captured in logs Path-basedDiagnosis

Talk Outline • Introduction • eBay’s infrastructure • 3 statistical approaches • Early results Path-basedDiagnosis

eBay’s Infrastructure • 2 physical tiers • Web server/app server + DB • Migrating to Java (WebSphere) from C++ • SuperCAL (Centralized Application Logging) • API for app developer to log anything to CAL • Runtime platform provides application-generic logging: cookie, host, URL, DB table(s), status, duration, etc. • Supports nested txns • A path can be identified via thread ID + host ID Path-basedDiagnosis

SuperCAL Architecture • Stats • 2K app servers, 40 SuperCAL machines • 1B URLs/day • 1TB raw logs/day (150GB gzipped), 200Mbps peak detection App Servers LB Switch diagnosis …… Real-time msg bus Path-basedDiagnosis

Failure Analysis • Summarize each transaction into: • What features are causing requests to fail? • Txn type, txn name, pool, host, version, DB, or a combination of these? • Different causes require different recovery techniques Features Class Path-basedDiagnosis

3 Approaches • Machine learning • Decision trees • MinEntropy – eBay’s greedy variant of decision trees • Data mining • Association rules Path-basedDiagnosis

Sunny Cloudy Y No new snow New snow Y N Decision Trees • Classifiers developed in the statistical machine learning field • Example: go skiing tomorrow? • “learning” => inferring the decision trees rules from data New snow No new snow Y Cloudy Sunny Y N Path-basedDiagnosis

Decision Trees • Feature selection • Look for features that best separates the classes • Different algorithms uses different metrics to measure “skewness” (e.g. C4.5 uses information gain) • The goal of decision tree algorithm • to split nodes until leaves are “pure” enough or until no further split is possible • i.e. pure => all data points have the same class label • Use pruning heuristics to control over-fitting Path-basedDiagnosis

(Correct, incorrect) Decision Trees – Sample Output • Pool = icgi1 | TxnName = LeaveFeedback: failed (8,1) | TxnName = MyFeedback: failed (205,3) Pool = icgi2 | TxnName = Respond: failed (1) | TxnName = ViewFeedback: failed (3554,52) • Naïve diagnosis: • Pool=icgi1 and TxnName=LeaveFeedback • Pool=icgi1 and TxnName=MyFeedback • Pool=icgi2 and TxnName=Respond • Pool=icgi2 and TxnName=ViewFeedback icgi1 icgi2 Respond MyFdbk LeaveFdbk ViewFdbk 8 205 1 3554 Path-basedDiagnosis

icgi1 icgi2 MyFdbk MyFdbk Respond Respond 205 3554 205 3554 Feature Selection Heuristics • Ignore leaf nodes with no failed transactions • Problem: noisy leaves • keep the top N leaves, or ignore nodes with < M% failues • Problem: features may not be independent • drop ancestor nodes that are “subsumed” by the leaves • Rank by impact • sort the predicted causes by failure count icgi1 icgi2 LeaveFdbk Respond MyFdbk ViewFdbk 8 205 1 3554 Path-basedDiagnosis

MinEntropy • Entropy measures the randomness of data • E.g. if failure is evenly distributed (very random), then entropy is high • Rank features by the normalized entropy • Greedy approach searches for the leaf node with most failures • Always produces one and exactly one diagnosis • Deployed on the entire eBay site • Sends real-time alerts to ops • Pros: fast (<1s for 100K txns and scales linearly) • Cons: optimized for single faults Path-basedDiagnosis

MinEntropy example Alert: Version E293 causing URL failures (not specific to any URL) in pool CGI1 Path-basedDiagnosis

Association Rules • Data mining technique to compute item sets • e.g. Shoppers who bought this item also shopped for … • Metrics • Confidence: (# of A & B) / # of A • Conditional probability of B given A • Support: (# of A & B)/total # of txns • Generates rules for all possible sets • e.g. machine=abc, txn=login => status=NullPointer (conf:0.1, support=0.02) • Applied to failure diagnosis • Find all rules that has failed status on the right, then rank by conf • Pros: looks at combinations of features • Cons: generates many rules Path-basedDiagnosis

Association Rules – Sample Output • Sample output (rules containing failures): TxnType=URL Pool=icgi2 TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) Pool=icgi2 TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) TxnType=URL TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) • Problem: features may not be independent • e.g. all LeaveFeedback txns are of type URL • Drop rules that are subsumed by more specific rules • Diagnosis: TxnName=LeaveFeedback Path-basedDiagnosis

Experimental Setup • Dataset • About 1/8 of the whole site • 10 one-minute traces, 4 with 2 concurrent faults • total of 14 independent faults • True faults identified through post-mortems, ops chat logs, application logs, etc. • Metrics • Precision: (# of identified faults) / (# of true faults) • Recall: (# of identified faults) / (# of predicted faults) Path-basedDiagnosis

True causes for DB-related failures are captured in the dataset Variable number of DBs used by each txn Feature selection heuristics Ignore leaf nodes with no failed transactions Noise filtering ignore nodes with < M% failues (in this case, M = 10) Path trimming drop ancestor nodes subsumed by the leaf nodes Results: DBs in Dataset Path-basedDiagnosis

True cause not captured for DB-related failures C4.5 suffers from unbalanced dataset i.e. produces a single-rule that predicts every txn to be successful Results: DBs not in Dataset Path-basedDiagnosis

What’s next? • ROC curves • show tradeoff between precision and recall • Transient failures • Up-sample to balance dataset or use cost matrix • Some measure of the “confidence” of the prediction • More data points • Have 20hrs of logs that have failures Path-basedDiagnosis

Open Questions • How to deal with multiple symptoms? • E.g. DB outage causing multiple types of requests to fail • Treat it as multiple failures? • Failure importance (count vs. rate) • Two failures may have similar failure count • Low volume and higher failure rate vs. high volume and lower failure rate Path-basedDiagnosis

A Statistical Learning Approach to Diagnosing eBay’s Site