1 / 28

Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems

Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems. Leonardo R. Bachega. Papers. Problem Diagnosis in Large-Scale Computing Environments , A. Mirgorodskiy, N. Maruyama, Barton Miller , SC 2006;

Télécharger la présentation

Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega

  2. Papers • Problem Diagnosis in Large-Scale Computing Environments, A. Mirgorodskiy, N. Maruyama, Barton Miller, SC 2006; • DMTracker:Finding Bugs in Large-Scale Parallel Programs by Detecting Anomaly in Data Movements, Q. Gao, F. Qin, D. Panda, SC 2007.

  3. Motivation for the Papers • Debugging is a very hard task • ½ of the development time in sequential applications • Problem gets magnified in systems with hundreds of processes • Massively parallel systems becoming popular • How do we make parallel debugging easier by leveraging statistical bug detection techniques?

  4. Background Statistical Techniques • Explore properties likely to hold at certain program points • Run-time information collected in traces • Empirical Execution models (profiles): Built from trace information • Find similarities (and dissimilarities) between profiles • Classification into groups • Outliers as suspects for buggy behavior • Assumption:Correct behavior is the common case, faulty behavior is unusual - a deviation from the common case

  5. Anomalous behavior Paper 1: Miller’s Proc 1 Proc 2 Proc 3 Proc N-1 Proc N … Processes performing similar tasks

  6. Paper’s Main Ideas • Unusual process behavior detection by comparison with other processes • “Control flow” trace collection • Function call information • Per process trace analysis • Fail-stop: Processes that stop generating traces • Distance-based outlier detection: isolate processes that behave differently (non-fail-stop)

  7. Fault Model • Non-deterministic fail-stop failures • failing process stop collecting traces earlier • Infinite loops • process spends unusual amount of time in a particular function • Deadlock, livelock, starvation • deadlocked procs stop generating traces • Starving procs spend time in different parts than procs with resources granted • Load imbalance • Unusual little time spent on certain parts • Analyst identifies

  8. Limitations of Fault Model A problem that… • Happens in all nodes is considered normal behavior • Doesn’t change the ctrl flow is not detected • Happens too early can’t be tracked since the trace collection is limited (can’t go too far back in history)

  9. Finding Misbehaving Host • Earliest Last Timestamp • Identifies host that stopped generating the trace • Fail-stop problems: crashes, infinite blocking • Assume global clock synchronization: |Tmin – Tavg| > threshold • Behavioral Outliers • Identify traces different from the rest • Distance-based outlier detection • Pair-wise distance between traces • Suspect score for each process

  10. Profile’s distance metrics Time spent at f1 in host h Manhattan distance If h and g are similar: each function will consume similar amounts of time on both hosts and d(g,h) will be low

  11. Behavioral Outliers K-nearest neighbor algorithm: • Consider all common behaviors as normal • Parameter k adjusts the common behavior • Score: high for outliers, low for common behavior

  12. Finding Anomalies’ Causes • Last Trace Entry: function that failed • Can be misleading • Solution: look at sequences of calls • Max of Delta Vector: Function that differs most from the normal behavior (largest contribution to suspect score) • Anomalous time interval: • partition traces from all hosts in short intervals • Apply outlier detection: identify earliest fragment with outlier

  13. Results • Network stability problem • Fail-stop behavior • One node stops 500 seconds earlier than others • Earliest timestamp approach • Broadcast service • No fail-stop behavior • Suspect score from failed run traces

  14. Summary and Conclusions • Trace analysis to explain failures in large-scale distributed systems • Detect anomalies rather than massive failures • Identify both fail-stop and non-fail-stop anomalous behavior

  15. Anomalous behavior Paper 2: DMTracker Proc 1 Proc 2 Proc 3 Proc N-1 Proc N … Spatial Dissimilarity Processes performing similar tasks Proc 1 Proc 2 Proc 3 Temporal Dissimilarity Proc N-1 Proc N … Processes performing similar tasks

  16. Paper’s Main Ideas • Tracks abnormal behaviors in data movements (DM) • Works on Data movement chains: memory allocation, copies, sends/receives • Extract DM-invariants and check for violation of these invariants • Violations indicate potential bugs • Two types of invariants: • Temporal: frequently occurring data movements (Frequent chain or FC) • Spatial: clusters data movements across processes (Chain distribution or CD)

  17. Data Movement Chains Multi-processor DMs Single processor DMs Concatenation of memory operations of a trace file Match Sends/Receives from processes’ traces

  18. Key: Data Movement Chain Buggy Execution Normal Execution

  19. Data Movement-Based Invariants • FC-invariant based: temporal similarity • Similar DM-chains occur many times during execution • Large groups (frequently happening) of DM-chains • CD-invariant based: spatial similarity • Processes perform similar or identical tasks • Chain distribution clusters as CD-invariants

  20. DMTracker: Design Overview Function calls Memory mgmt: allocation/deallocation Data Movement: copies/network operations Records Key arguments / return values Call sites Thread IDs Local timestamps Correlates each operation to its source and destination

  21. Invariants generation • Groups formed by chains of same type • Chains of same type have the same • call sites for individual DM operations • allocation call sites for source and destination buffers

  22. FC-Invariants • Two criteria for invariants • Chains in the group must happens frequently • Chain type of each group must be “unique” • Uniqueness of chain: aggregation of uniqueness values of memory operations # of segments of data Tunable parameters

  23. FC-Invariant Anomaly Detection • Abnormality of P compared to C based in • Combined using harmonic mean: • Threshold for abnormality is an adjustable parameter

  24. CD-Invariants • Clusters of chain distributions across processes – one profile per trace (process) • DM chains in a particular trace • DM chains originated in a particular trace • Profile: frequency of chains in a trace profile: • K-nearest neighbor used to build invariants (clusters) Total # of Chains in trace T Total # of chains of group C2 in trace T Total # of distinct chain groups

  25. CD-Invariant Anomaly Detection • Abnormal trace: distance to k-nearest neighbor exceeds threshold • Exactly the same procedure as in paper1!

  26. DMTracker Results • FC-Invariant (15,075 times) violated by similar chains: 154 times • All processes triggered the bug • CD-Invariant: catches non-deterministic bug

  27. DMTracker Summary • Data Movement chains derived from traces • Frequency Chain and Chain Distribution invariants to capture temporal and spatial correlations in parallel system • Study cases show bug detection

  28. General Observations • Use of spatial and temporal invariants • Detection of deviant behavior as opposed to common behavior • Simple Machine Learning techniques applied for data classification • Bug detection in large systems using outlier detection • Very few results to support broad conclusions about the effectiveness of the techniques

More Related