1 / 21

A Statistical Learning Approach to Diagnosing eBay’s Site

A Statistical Learning Approach to Diagnosing eBay’s Site. Mike Chen , Alice Zheng, Jim Lloyd, Michael Jordan, Eric Brewer mikechen@cs.berkeley.edu. Motivation. Fast failure detection and diagnosis are critical to high availability

Télécharger la présentation

A Statistical Learning Approach to Diagnosing eBay’s Site

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Statistical Learning Approach to Diagnosing eBay’s Site Mike Chen, Alice Zheng, Jim Lloyd, Michael Jordan, Eric Brewer mikechen@cs.berkeley.edu

  2. Motivation • Fast failure detection and diagnosis are critical to high availability • But, exact root cause may not be required for many recovery techniques • Many potential causes of failures • Software bugs, hardware, configuration, network, database, etc. • Manual diagnosis is slow and inconsistent • Statistical approaches are ideal • Simultaneously examining many possible causes of failures • Robust to noise Path-basedDiagnosis

  3. Challenges • Lots of (noisy) data • Near real-time detection and diagnosis • Multiple independent failures • Root cause might not be captured in logs Path-basedDiagnosis

  4. Talk Outline • Introduction • eBay’s infrastructure • 3 statistical approaches • Early results Path-basedDiagnosis

  5. eBay’s Infrastructure • 2 physical tiers • Web server/app server + DB • Migrating to Java (WebSphere) from C++ • SuperCAL (Centralized Application Logging) • API for app developer to log anything to CAL • Runtime platform provides application-generic logging: cookie, host, URL, DB table(s), status, duration, etc. • Supports nested txns • A path can be identified via thread ID + host ID Path-basedDiagnosis

  6. SuperCAL Architecture • Stats • 2K app servers, 40 SuperCAL machines • 1B URLs/day • 1TB raw logs/day (150GB gzipped), 200Mbps peak detection App Servers LB Switch diagnosis …… Real-time msg bus Path-basedDiagnosis

  7. Failure Analysis • Summarize each transaction into: • What features are causing requests to fail? • Txn type, txn name, pool, host, version, DB, or a combination of these? • Different causes require different recovery techniques Features Class Path-basedDiagnosis

  8. 3 Approaches • Machine learning • Decision trees • MinEntropy – eBay’s greedy variant of decision trees • Data mining • Association rules Path-basedDiagnosis

  9. Sunny Cloudy Y No new snow New snow Y N Decision Trees • Classifiers developed in the statistical machine learning field • Example: go skiing tomorrow? • “learning” => inferring the decision trees rules from data New snow No new snow Y Cloudy Sunny Y N Path-basedDiagnosis

  10. Decision Trees • Feature selection • Look for features that best separates the classes • Different algorithms uses different metrics to measure “skewness” (e.g. C4.5 uses information gain) • The goal of decision tree algorithm • to split nodes until leaves are “pure” enough or until no further split is possible • i.e. pure => all data points have the same class label • Use pruning heuristics to control over-fitting Path-basedDiagnosis

  11. (Correct, incorrect) Decision Trees – Sample Output • Pool = icgi1 | TxnName = LeaveFeedback: failed (8,1) | TxnName = MyFeedback: failed (205,3) Pool = icgi2 | TxnName = Respond: failed (1) | TxnName = ViewFeedback: failed (3554,52) • Naïve diagnosis: • Pool=icgi1 and TxnName=LeaveFeedback • Pool=icgi1 and TxnName=MyFeedback • Pool=icgi2 and TxnName=Respond • Pool=icgi2 and TxnName=ViewFeedback icgi1 icgi2 Respond MyFdbk LeaveFdbk ViewFdbk 8 205 1 3554 Path-basedDiagnosis

  12. icgi1 icgi2 MyFdbk MyFdbk Respond Respond 205 3554 205 3554 Feature Selection Heuristics • Ignore leaf nodes with no failed transactions • Problem: noisy leaves • keep the top N leaves, or ignore nodes with < M% failues • Problem: features may not be independent • drop ancestor nodes that are “subsumed” by the leaves • Rank by impact • sort the predicted causes by failure count icgi1 icgi2 LeaveFdbk Respond MyFdbk ViewFdbk 8 205 1 3554 Path-basedDiagnosis

  13. MinEntropy • Entropy measures the randomness of data • E.g. if failure is evenly distributed (very random), then entropy is high • Rank features by the normalized entropy • Greedy approach searches for the leaf node with most failures • Always produces one and exactly one diagnosis • Deployed on the entire eBay site • Sends real-time alerts to ops • Pros: fast (<1s for 100K txns and scales linearly) • Cons: optimized for single faults Path-basedDiagnosis

  14. MinEntropy example Alert: Version E293 causing URL failures (not specific to any URL) in pool CGI1 Path-basedDiagnosis

  15. Association Rules • Data mining technique to compute item sets • e.g. Shoppers who bought this item also shopped for … • Metrics • Confidence: (# of A & B) / # of A • Conditional probability of B given A • Support: (# of A & B)/total # of txns • Generates rules for all possible sets • e.g. machine=abc, txn=login => status=NullPointer (conf:0.1, support=0.02) • Applied to failure diagnosis • Find all rules that has failed status on the right, then rank by conf • Pros: looks at combinations of features • Cons: generates many rules Path-basedDiagnosis

  16. Association Rules – Sample Output • Sample output (rules containing failures): TxnType=URL Pool=icgi2 TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) Pool=icgi2 TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) TxnType=URL TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) • Problem: features may not be independent • e.g. all LeaveFeedback txns are of type URL • Drop rules that are subsumed by more specific rules • Diagnosis: TxnName=LeaveFeedback Path-basedDiagnosis

  17. Experimental Setup • Dataset • About 1/8 of the whole site • 10 one-minute traces, 4 with 2 concurrent faults • total of 14 independent faults • True faults identified through post-mortems, ops chat logs, application logs, etc. • Metrics • Precision: (# of identified faults) / (# of true faults) • Recall: (# of identified faults) / (# of predicted faults) Path-basedDiagnosis

  18. True causes for DB-related failures are captured in the dataset Variable number of DBs used by each txn Feature selection heuristics Ignore leaf nodes with no failed transactions Noise filtering ignore nodes with < M% failues (in this case, M = 10) Path trimming drop ancestor nodes subsumed by the leaf nodes Results: DBs in Dataset Path-basedDiagnosis

  19. True cause not captured for DB-related failures C4.5 suffers from unbalanced dataset i.e. produces a single-rule that predicts every txn to be successful Results: DBs not in Dataset Path-basedDiagnosis

  20. What’s next? • ROC curves • show tradeoff between precision and recall • Transient failures • Up-sample to balance dataset or use cost matrix • Some measure of the “confidence” of the prediction • More data points • Have 20hrs of logs that have failures Path-basedDiagnosis

  21. Open Questions • How to deal with multiple symptoms? • E.g. DB outage causing multiple types of requests to fail • Treat it as multiple failures? • Failure importance (count vs. rate) • Two failures may have similar failure count • Low volume and higher failure rate vs. high volume and lower failure rate Path-basedDiagnosis

More Related