1 / 25

Failure Data Collection and Analysis

Failure Data Collection and Analysis. Archana Ganapathi Peter Bodik Wei Xu. Motivation (1) My machine crashes…. Since 3/1/04… 3 system crashes 18 application errors 96 application hangs Who cares? I do! People who share similar experiences In general, customer uproar.

roseanne
Télécharger la présentation

Failure Data Collection and Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu

  2. Motivation (1)My machine crashes… Since 3/1/04… • 3 system crashes • 18 application errors • 96 application hangs Who cares? • I do! • People who share similar experiences • In general, customer uproar

  3. Motivation (2)An Internet service has failures… Who cares? • Internet service users • Internet service system administrators • Anyone affected by the IS’s loss of revenue Total: 61 user-visible failures in 12 months at Online Service

  4. Motivation (3) • ROC/RADS needs real failure/attack information • to drive benchmarks • evaluate our prototypes • help us select what we work attack

  5. Data Sources • 1000s of individual machines • Cory/Soda Hall, BOINC • Large clusters at real Internet services • Internet services • Distributed applications on 100s of machines • PlanetLab

  6. Individual Machines

  7. Data Collection • Collect minidumps that contain… • The Stop message/parameters/data • Loaded drivers • Processor context for processor that stopped • Process info/kernel context for process/thread stopped • The Kernel-mode call stack for thread that stopped • Frequency of collection • synchronized with application and system crashes on computers

  8. Analysis results • What happened that is immediately responsible for the crash • exact error code • brief description, primarily for debugging • Bucketing info, e.g.: "driver fault" • Details for debugging, e.g. stack contents • Use Microsoft’s publicly available analysis tools • Caveat: significant variability in results between internal and public version of tool!

  9. How we collect minidumps (1) Corporate Error Reporting http://www.microsoft.com/resources/satech/cer/ • Manage error reports/msgs generated by WER and other programs • Configure clients to redirect reports to CER shared directory

  10. Sample Statistics(25 nodes, 5 days)

  11. Sample Statistics(25 nodes, 5 days)

  12. How we collect minidumps (2) BOINC • For SETI@home –esque apps that pool resources • Provides client API to send/receive data to/from BOINC server • Write tools to read info in minidump directory and send to us

  13. Sample Statistics (50 system crashes)

  14. Sample Statistics (50 system crashes) • CLASSPNP.SYS 2 • win32k.sys 2 • SynTP.sys 1 • TDI.SYS 1 • ino_fltr.sys 1 • ks.sys 1 • drvnddm.sys 1 • ntkrnlmp.exe 1 • Pool_Corruption 1 • watchdog.sys 7 • ar5211.sys 6 • ibmpmdrv.sys 6 • ati3duag.dll 5 • SYMEVENT.SYS 3 • ipsecw2k.sys 3 • memory_corruption 3 • ialmdev5.DLL 2 • PSCRIPT4.DLL 2 • ntoskrnl.exe 2

  15. Metrics (Windows & Linux) • Availability • system uptime, % time BOINC running • CPU(s) • # processes, processor queue length, % non-idle • Memory • available physical memory, free swap space • Disk(s) • free space • Network(s) • IP address, packets&bytes sent&received/sec, bandwidth to/from SETI@home server, first-hop bandwidth*, network coordinates* • Static • CPU type, #, and benchmarks; total memory; OS type

  16. Questions • Other metrics? • Frequency with which to measure them? • What research questions can we answer with this data set? • original goal: workload to evaluate our node discovery service • evaluate effectiveness of network coordinates • evaluate potential to run more than just “embarrassingly parallel” apps on this type of infrastructure depending on • machines’ uptime • network connectivity • available disk space • distributed analysis? • security uses?

  17. Internet Services

  18. Data characteristics • Real companies • Multitude of users • Voluminous data (several terabytes) • Systems are complex • Treat as black box • Use SLT algorithms for analysis • More data => better models

  19. Analysis Results • Study event logs • Not necessarily failures • Can derive models of good & bad behavior • Models with varying granularity • Use different algorithms • Vary boundary parameters • For more details see poster: “Towards a General Approach for Event Log Analysis”

  20. Distributed Apps

  21. PlanetLab • “An open platform for developing, deploying, and accessing planetary-scale services” • 392 nodes at 164 sites around the world • Per-site system administration • Applications: OceanStore, PIER

  22. Why? • Platform for injecting faults and testing our algorithms • Applications on RADS-like environment • Research platform • More accessible • University-developed apps most likely to be tested on PlanetLab

  23. Applications 1) OceanStore • Global persistent data store. • In the process of running prototype on PlanetLab • Good source of failure data 2) PIER • Distributed query processor • Currently running on PlanetLab • Good source of failure data + analysis engine

  24. What do we do with these apps? • Instrument applications to collect any type of information • Choice of granularity • Open source - no longer black box • Can modify it as much as necessary

  25. Questions • What other applications can we use? • What should we measure and model? • What information is useful for industry? • Do you have any failure/attack data you are willing to share with us?

More Related