Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems

Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega

Papers • Problem Diagnosis in Large-Scale Computing Environments, A. Mirgorodskiy, N. Maruyama, Barton Miller, SC 2006; • DMTracker:Finding Bugs in Large-Scale Parallel Programs by Detecting Anomaly in Data Movements, Q. Gao, F. Qin, D. Panda, SC 2007.

Motivation for the Papers • Debugging is a very hard task • ½ of the development time in sequential applications • Problem gets magnified in systems with hundreds of processes • Massively parallel systems becoming popular • How do we make parallel debugging easier by leveraging statistical bug detection techniques?

Background Statistical Techniques • Explore properties likely to hold at certain program points • Run-time information collected in traces • Empirical Execution models (profiles): Built from trace information • Find similarities (and dissimilarities) between profiles • Classification into groups • Outliers as suspects for buggy behavior • Assumption:Correct behavior is the common case, faulty behavior is unusual - a deviation from the common case

Anomalous behavior Paper 1: Miller’s Proc 1 Proc 2 Proc 3 Proc N-1 Proc N … Processes performing similar tasks

Paper’s Main Ideas • Unusual process behavior detection by comparison with other processes • “Control flow” trace collection • Function call information • Per process trace analysis • Fail-stop: Processes that stop generating traces • Distance-based outlier detection: isolate processes that behave differently (non-fail-stop)

Fault Model • Non-deterministic fail-stop failures • failing process stop collecting traces earlier • Infinite loops • process spends unusual amount of time in a particular function • Deadlock, livelock, starvation • deadlocked procs stop generating traces • Starving procs spend time in different parts than procs with resources granted • Load imbalance • Unusual little time spent on certain parts • Analyst identifies

Limitations of Fault Model A problem that… • Happens in all nodes is considered normal behavior • Doesn’t change the ctrl flow is not detected • Happens too early can’t be tracked since the trace collection is limited (can’t go too far back in history)

Finding Misbehaving Host • Earliest Last Timestamp • Identifies host that stopped generating the trace • Fail-stop problems: crashes, infinite blocking • Assume global clock synchronization: |Tmin – Tavg| > threshold • Behavioral Outliers • Identify traces different from the rest • Distance-based outlier detection • Pair-wise distance between traces • Suspect score for each process

Profile’s distance metrics Time spent at f1 in host h Manhattan distance If h and g are similar: each function will consume similar amounts of time on both hosts and d(g,h) will be low

Behavioral Outliers K-nearest neighbor algorithm: • Consider all common behaviors as normal • Parameter k adjusts the common behavior • Score: high for outliers, low for common behavior

Finding Anomalies’ Causes • Last Trace Entry: function that failed • Can be misleading • Solution: look at sequences of calls • Max of Delta Vector: Function that differs most from the normal behavior (largest contribution to suspect score) • Anomalous time interval: • partition traces from all hosts in short intervals • Apply outlier detection: identify earliest fragment with outlier

Results • Network stability problem • Fail-stop behavior • One node stops 500 seconds earlier than others • Earliest timestamp approach • Broadcast service • No fail-stop behavior • Suspect score from failed run traces

Summary and Conclusions • Trace analysis to explain failures in large-scale distributed systems • Detect anomalies rather than massive failures • Identify both fail-stop and non-fail-stop anomalous behavior

Anomalous behavior Paper 2: DMTracker Proc 1 Proc 2 Proc 3 Proc N-1 Proc N … Spatial Dissimilarity Processes performing similar tasks Proc 1 Proc 2 Proc 3 Temporal Dissimilarity Proc N-1 Proc N … Processes performing similar tasks

Paper’s Main Ideas • Tracks abnormal behaviors in data movements (DM) • Works on Data movement chains: memory allocation, copies, sends/receives • Extract DM-invariants and check for violation of these invariants • Violations indicate potential bugs • Two types of invariants: • Temporal: frequently occurring data movements (Frequent chain or FC) • Spatial: clusters data movements across processes (Chain distribution or CD)

Data Movement Chains Multi-processor DMs Single processor DMs Concatenation of memory operations of a trace file Match Sends/Receives from processes’ traces

Key: Data Movement Chain Buggy Execution Normal Execution

Data Movement-Based Invariants • FC-invariant based: temporal similarity • Similar DM-chains occur many times during execution • Large groups (frequently happening) of DM-chains • CD-invariant based: spatial similarity • Processes perform similar or identical tasks • Chain distribution clusters as CD-invariants

DMTracker: Design Overview Function calls Memory mgmt: allocation/deallocation Data Movement: copies/network operations Records Key arguments / return values Call sites Thread IDs Local timestamps Correlates each operation to its source and destination

Invariants generation • Groups formed by chains of same type • Chains of same type have the same • call sites for individual DM operations • allocation call sites for source and destination buffers

FC-Invariants • Two criteria for invariants • Chains in the group must happens frequently • Chain type of each group must be “unique” • Uniqueness of chain: aggregation of uniqueness values of memory operations # of segments of data Tunable parameters

FC-Invariant Anomaly Detection • Abnormality of P compared to C based in • Combined using harmonic mean: • Threshold for abnormality is an adjustable parameter

CD-Invariants • Clusters of chain distributions across processes – one profile per trace (process) • DM chains in a particular trace • DM chains originated in a particular trace • Profile: frequency of chains in a trace profile: • K-nearest neighbor used to build invariants (clusters) Total # of Chains in trace T Total # of chains of group C2 in trace T Total # of distinct chain groups

CD-Invariant Anomaly Detection • Abnormal trace: distance to k-nearest neighbor exceeds threshold • Exactly the same procedure as in paper1!

DMTracker Results • FC-Invariant (15,075 times) violated by similar chains: 154 times • All processes triggered the bug • CD-Invariant: catches non-deterministic bug

DMTracker Summary • Data Movement chains derived from traces • Frequency Chain and Chain Distribution invariants to capture temporal and spatial correlations in parallel system • Study cases show bug detection

General Observations • Use of spatial and temporal invariants • Detection of deviant behavior as opposed to common behavior • Simple Machine Learning techniques applied for data classification • Bug detection in large systems using outlier detection • Very few results to support broad conclusions about the effectiveness of the techniques

Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems

Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems

Presentation Transcript

Efficient Parallel Software for Large-Scale Semidefinite Programs

Large-scale adaptive systems

Large-Scale Distributed Systems

Large-scale adaptive systems

Large-scale adaptive systems

Large-scale adaptive systems

Large-scale adaptive systems

Large-scale Hybrid Parallel SAT Solving

Large Scale Parallel Print Service

Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Large-Scale Distributed Systems

Large Scale Computing Systems

Large Scale File Systems

Large Scale Crawling the Web for Parallel Texts

Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Large-scale adaptive systems

Large-Scale Systems

Parallel Subgraph Listing in a Large-Scale Graph

Large Scale Parallel Print Service

UML - Mechanisms for Large Scale Systems