Online System Problem Detection by Mining Console Logs

Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang† Armando Fox* David Patterson* Michael Jordan* • *UC Berkeley † Intel Labs Berkeley

Why console logs? • Detecting problems in large scale Internet services often requires detailed instrumentation • Instrumentation can be costly to insert & maintain • High code churn • Often combine open-source building blocks that are not all instrumented • Can we use console logs in lieu of instrumentation? + Easy for developer, so nearly all software has them – Imperfect: not originally intended for instrumentation

Problems we are looking for what is wrong with blk_2 ??? receivingblk_1 received blk_1 receivingblk_2 NORMAL ERROR The easy case – rare messages Harder but useful - abnormal sequences

Overview and Contributions Dominant cases Frequent pattern based filtering Parsing* OK Non-pattern Normal cases OK Free text logs200 nodes PCA Detection ERROR Real anomalies • Accurate online detection with small latency * Large-scale system problem detection by mining console logs (SOSP’ 09)

Constructing event traces from console logs receiving blk_1 receiving blk_1 received blk_1 received blk_1 reading blk_1 reading blk_1 receivingblk_1 receiving blk_1 received blk_1 received blk_1 reading blk_1 reading blk_1 receiving blk_2 received blk_2 receiving blk_2 receivingblk_2 receivedblk_2 receiving blk_2 • Parse: message type + variables • Group messages by identifiers (automatically discovered) • Group ~= event trace

Online detection: When to make detection? receivingblk_1 receivingblk_1 received blk_1 received blk_1 reading blk_1 reading blk_1 deleting blk_1 deleted blk_1 • Cannot wait for the entire trace • Can last arbitrarily long time • How long do we have to wait? • Long enough to keep correlations • Wrong cut = false positive • Difficulties • No obvious boundaries • Inaccurate message ordering • Variations in session duration Time

Frequent patterns help determine session boundaries • Key Insight: Most messages/traces are normal • Strong patterns • “Make common paths fast” • Tolerate noise

Two stage detection overview Dominant cases Frequent pattern based filtering Parsing OK Non-pattern Normal cases OK Free text logs200 nodes PCA Detection ERROR Real anomalies

Stage 1 - Frequent patterns (1): Frequent event sets Repeat until all patterns found receivingblk_1 receivingblk_1 Coarse cut by time received blk_1 reading blk_1 Find frequent item set received blk_1 Refine time estimation error blk_1 reading blk_1 deleting blk_1 deleted blk_1 PCA Detection Time

Stage 1 - Frequent patterns (2) : Modeling session duration time Count Pr(X>=x) • Assuming Gaussian? • 99.95th percentile estimation is off by half • 45% more false alarms • Mixture distribution • Power-law tail + histogram head Duration Duration

Stage 2 - Handling noise with PCA detection Dominant cases Frequent pattern based filtering Parsing OK Non-pattern Normal cases OK Free text logs200 nodes PCA Detection ERROR Real anomalies • More tolerant to noise • Principal Component Analysis (PCA) based detection

Frequent pattern matching filters most of the normal events 86% 100% Dominant cases Frequent pattern based filtering Parsing* OK Non-pattern 13.97% 14% Normal cases OK Free text logs200 nodes PCA Detection ERROR Real anomalies 0.03%

Evaluation setup • Hadoop file system (HDFS) • Experiment on Amazon’s EC2 cloud • 203 nodes x 48 hours • Running standard map-reduce jobs • ~24 million lines of console logs • 575,000 traces • ~ 680 distinct ones • Manual label from previous work • Normal/abnormal + why it is abnormal • “Eventually normal” – did not consider time • For evaluation only

Frequent patterns in HDFS (Total events ~20 million) • Covers most messages • Short durations

Detection latency Frequent pattern (matched) Single event pattern Frequent pattern (timed out) Non pattern events Detection latency is dominated by the wait time

Detection accuracy (Total trace = 575,319) • Ambiguity on “abnormal” • Manual labels:“eventually normal” • > 600 FPs in online detection as very long latency • E.g. a write session takes >500sec to complete (99.99th percentile is 20sec)

Future work • Distributed log stream processing • Handle large scale cluster + partial failures • Clustering alarms • Allowing feedback from operators • Correlation on logs from multiple applications / layers

Summary Dominant cases Frequent pattern based filtering Parsing OK Non-pattern Normal cases OK Free text logs200 nodes 24 million lines PCA Detection ERROR Real anomalies http://www.cs.berkeley.edu/~xuw/ Wei Xu <xuw@cs.berkeley.edu>

Online System Problem Detection by Mining Console Logs