1 / 27

Detecting Large-Scale System Problems by Mining Console Logs

Detecting Large-Scale System Problems by Mining Console Logs. Author : Wei Xu* , Ling Huang†, Armando Fox* David Patterson* ,Michael Jordan* Conference: ICML 2010, ACM SOSP2009 Advisor: Yuh-Jye Lee Reporter: Yi-Hsiang Yang Email: M9915016@mail.ntust.edu.tw. Outline.

skylar
Télécharger la présentation

Detecting Large-Scale System Problems by Mining Console Logs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Detecting Large-Scale System Problems by Mining Console Logs Author : Wei Xu* , Ling Huang†, Armando Fox* David Patterson* ,Michael Jordan* Conference: ICML 2010,ACM SOSP2009 Advisor: Yuh-Jye Lee Reporter: Yi-Hsiang Yang Email: M9915016@mail.ntust.edu.tw

  2. Outline • Introduction • Methodology • Evaluation and Visualization • Conclusion

  3. Introduction • Information of console logs? • Console logs rarely help in large-scale datacenter services • Operational problems are dependent on the deployment and runtime environment • Typical console log is much more structured • Anomalydetection • Unusual log messages often indicate the source of the problem

  4. Workflow • Log Parsing • Convert a log message from unstructured text to a data structure • Feature creation • Constructing the state ratio vector and the message count vector features • Anomaly detection • Principal Component Analysis(PCA)-based anomaly detection method • Visualization • Decision tree

  5. Workflow

  6. Log Parsing with Source Code • Difficulty: Templatize automatically • C language • fprintf(LOG, "starting: xact %d is %s") • Java • CLog.info("starting: " + txn) • Not easy to distinguish variables、states

  7. Parsing Approach-Source Code • Generate the source code’s abstract syntax tree (AST) • Use AST to identify all method calls on objects of the classes (or their subclasses) • Deduce the types of variables in message templates

  8. Parsing Approach-Source Code

  9. Parsing Approach-Log • Apache Lucene reverse index • Implement as a Hadoop map-reduce job • Replicating the index to every node and partitioning • The map stage performs the reverse-index search • The reduce stage processing depends on the features to be constructed

  10. Parsing Approach

  11. Feature Creation • The state ratio vector • Each state ratio vector : a group of state variables in a time window • The message count vector • Each vector dimension : different message type • Value of the dimension : messages appear in the message group

  12. Feature Creation-The message count vector

  13. Anomaly Detection-Principal Component Analysis (PCA)

  14. Anomaly Detection-Principal Component Analysis (PCA) • Applied Term Frequency / Inverse Document Frequency (TF-IDF) • Replace each entry yi,jwith a weighted entry wi,j ≡ yi,jlog(n/dfj), where dfjis total number of message groups that contain the j-th message type

  15. Evalution and Visualization • From Elastic Compute Cloud (EC2) • 203 nodes of HDFS and 1 nodes of Darkstar

  16. Evalution and Visualization • Parse fails when cannot find a message template that matches the message and extract message variables.

  17. Evalution and Visualization • 50 nodes, takes less than 3 minutes , less than 10 minutes with 10 node

  18. Evalution and Visualization-Darkstar • DarkMud • Provided by the Darkstar team • Emulated 60 user clients in the DarkMud virtual world performing random operations • Ran the experiment for 4800 seconds • Injected a performance disturbance by capping the CPU during time 1400 to 1800 sec

  19. Disturbance by capping the CPU

  20. Evalution and Visualization-Darkstar • Ratio between number of ABORTINGto COMMITTING increases from about 1:2000 to about 1:2 • Darkstar does not adjust transaction timeout accordingly

  21. Evalution and Visualization-Darkstar • Augmented each feature vector using the timestamp of the last message in that group

  22. Evalution and Visualization -Hadoop

  23. Evalution and Visualization -Hadoop

  24. Evalution and Visualization-Hadoop

  25. Conclusion • Using source code as a reference to understand the structure of console logs are able to parse logs accurately • New opportunities for turning built-in console logs into a powerful monitoring system for problem detection

  26. Thanks for your attention Q&A

More Related