Simulation
E N D
Presentation Transcript
Simulation Remzi Arpaci-Dusseau, Peter Druschel, Vivek Pai, Karsten Schwan, David Clark, John Jannotti, Liuba Shrira, Mike Dahlin, Miguel Castro, Barbara Liskov, Jeff Mogul
What was the session about? • How do you know whether your system design or implementation meets your goals? • At scales larger than you can actually test • Over longer time frames • With loads/faults/changes that aren’t normally seen • “Simulation” as a way around this • Stretch reality beyond what you can test directly (e.g., on a testbed)
Problems with simulation • We don’t trust our simulators • Can we get “real scale” rather than oversimplified simulation of scale? • Focus has been on mathematical properties of workload rather than • errors • Unanticipated uses (attacks, semi-attacks) • Secondary behaviors
Two main issues • Engines to run simulation • Workloads/faultloads/changeloads/ topologies
Engines to run simulation • Scale issues • Expressibility (e.g., delays, misbehavior, heterogeneity) • Performance • Repeatability/controllability • Plugability of components • Not a good match for clusters • Need to fix this, because big CPUs are history
Workloads/faultloads/changeloads/ topologies • What range of things do we have to cover? • How do we find out what happens in real life? • Anticipating things that haven’t happened before • Simulating security threats (unknown worms, botnets, etc.) • How do you manage a system that has failed? • Metaphor: preparing the surfaces before painting the house
Approaches for engines • Use SETI-at-home approach • To get scale and some exposure to errors • PlanetLab is too small, too well-connected • “honeypots-at-home”? • Ask for access to Windows Update • Fault injection • Trace-driven or model-based
Simulation tools require community consensus • Otherwise reviewers don’t trust results • Provides shared “shorthand” for what a published simulation result means • Need some sort of consensus-building process • Requires lots of effort, testing, bug-fixing • Tends to draw community into one standard
*loads • Need “repeatable reality” • Need some diversity • Need enough detail • E.g., link bandwidths, error rates • Need well-documented assumptions • And a way to describe the range of these
So we can almost do “networks”; can we do “distributed systems”? • What do you need beyond network details? • Content-sensitive behavior • Fault models • How does user behavior change? • Especially in response to system behavior changes • I/O, memory, CPU, other resource constraints • Changes in configuration
Fault injection • Problem: disk vendors don’t tell you what they know • Bigger users share anecdotes, not data • Look at what they do, infer what problem is being solved • “disk-fault-at-home” • Microsoft has Watson data – would they share? • Linux ought to be gathering similar data! • Also need “behavior injection” for pre-fault sequences • Need more methodology • Crazy subsystems after unexpected requests • Better-than-random fault injection • How do we model (collect data on) correlated faults? • Does scale help or hinder independence?
Things to spend money on • Obtain topologies/faultloads/changeloads • Increase realism (more detailed) • Maintain/evolve these as the world changes • Pay for the maintenance costs of a community-consensus simulator • Or more than one, for different purposes • Enough resources for repeatable results • Don’t ignore storage and storage bandwidth
Things that cannot be solved with “more money” alone • Scale beyond what we can plausibly afford or manage • Time scales • Dynamic behaviors • Access to real-world fault/behavior data