Network Automation: Empowering Innovations in Operations
E N D
Presentation Transcript
Network Automation Albert Greenberg, Nick Feamster, Richard Mortier, Mark Poepping, Lun Li, Sharad Agarwal, Changhoon Kim, Ramveer Chandra, et al.
What is network automation? • The performance of the following network tasks with minimal human involvement: • Provisioning • Detection • Diagnosis • Remediation • Corollary: Humans become involved with network operation at higher levels (i.e., not repeatedly doing the same painful tasks)
Some Questions • Why automate? • What to automate? (desired end states) • How do we get there? • Robotize current methodology, or rethink? • Self-correction (like biological systems, e.g., DNA) • What are the roadblocks? • Are our network element building blocks and their behavior fit for automation? • Big guard rails?
Why Automate? • Human cost • Are we talking about making operators redundant? • No…it’s more about automating folklore? • Care costs >> Ops costs, so self-help >> self-managing? • Reliability!!! • Continuous high quality service – very high availability • Faster detection, remediation, etc. • Scale!!! • How else to keep up with feature creep? • “Every case is a special case” (we don’t really believe this)
What to Automate? • Proactive Piece • Is-ness spec driving automation? • Reactive Piece • Detection (See) • Possible to monitor and detect network problems? • What data sets are needed? • How to do correlation of those datasets? (metadata) • The role of detection vs. statistical analysis • Diagnosis (Know) • Again, what data needs to be collected to make this possible • Stat based vs model based? • Remediation (Restore) • Do we want automated scripts How far along this spectrum to go? (Many answers.)
Vision • Network operators plug in boxes, and walk away…sort of • A small set of policies trigger programs which write programs which write programs which … realizes the network • A small set of probes provide all measurements and event collection/ correlation needed to support internal metrics and external SLAs • Knowledge database • Operators become specialists: forensics, software development, etc. (operation at a higher level, less fire-fighting) • Caveat: there will always be a need for amazing people, but doing more introspective work: (design, test, certification ... and … automation over-ride when needed)
Roadblocks • Cost • Complexity • Data • Knowledge • Human factors
Obstacle 1: Cost • Automation costs money and time • Worth detecting if there’s nothing to do about it? • Worth automating if the operation only happens once? • Alternate solution 1: Monkeys • At what point is it time to automate the corner case • Alternate solution 2: Overprovision • Perhaps we can ride out the storm… (or expect failures and design low cost systems so that they don’t really matter) • Server community has seen that repeatable simple components + software can provide both very low cost and resilient whole (e.g., Google switching and computing platform)
Obstacle 2: Complexity • How to manage it? • Dummy boxes and lots of wires/stitching • Monolithic box with complexity in configuration • Fewer types of boxes, templates, ways to do essentially the same thing? • Coke’s network vs Pepsi’s network?
Obstacle 3: Data • Lots of inputs • Topology • Configuration • Fault events (measured and logged) • Performance events (e.g., active measurements) • Version numbers • Fiber mappings • Metadata • Crucial! Version numbers, gaps in data collection, collection method, staleness… • If this data goes inconsistent, big surprises! • Challenges • Correlation • what to do when data isn’t correlated? • Privacy and sharing issues
Obstacle 4: Human Nature/Corner Cases • Operators are used to touching routers • Automation effectively adds a “shim” • Humans will likely want a way to bypass the configuration database • How to maintain consistency between human tweakage and the database? • How to evolve the automation database? (when does a corner case become “normal”)