1 / 23

MyOps

MyOps. An Operational Framework for PlanetLab Deployments. Outline. Objective of MyOps Current status Future ideas Questions at any time. Example of Feedback. Objective : Close Operational Cycle. System - Provides service (slice) Monitoring - Feedback from running system

cyrah
Télécharger la présentation

MyOps

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MyOps An Operational Framework for PlanetLab Deployments

  2. Outline • Objective of MyOps • Current status • Future ideas • Questions at any time

  3. Example of Feedback

  4. Objective : Close Operational Cycle • System - Provides service (slice) • Monitoring - Feedback from running system • Operator - Interpret feedback into tasks • Management - Control running system

  5. Challenges: Break-down • System may not deliver service • Monitoring not observe useful metrics • Operator may not know • how to interpret observations • how to control the system • what the service goals are • Management may not control system

  6. Requirements for Operational Systems • Satisfy Minimal Conditions • Physical Integrity • Interconnectivity • Controllable • Provide a Service • Two requirements • Reliably reach the final condition • When failures occurs, repair or report automatically • Two approaches in MyOps • Precise bootstrap stages (not discussed) • Operational monitoring & management in platform

  7. System: PlanetLab Slices

  8. Monitoring Types Open-loop monitoring • Identify the unknown • More information, fine-grained Operational monitoring (closed-loop) • Correctness • Less information, coarse-grained • Actionable

  9. Management Types Open-loop management • Bootstrap/Deploy from the ground up • Inefficient, coarse-grained • No feed-back Operational management (closed-loop) • Tweak the system to correct behavior • More efficient, fine-grained

  10. Example • Observe: Node is Off-Line • Control: Attempt to Power-On • Observe: Node is On-line but Failed to boot • Observe: Failed to boot Error • Control: Create ticket & Send email to local contact • Time passes • Control: Disable slice creation • Observe: Local contact responds • Observe: Node is Power-on and Running • Control: Re-enable slice creation • Contro: Close ticket

  11. History of PlanetLab Operations Open-loop Monitoring with Open-loop Management • Collect fine-grained statistics using CoMon • Act with coarse-grained operations (e.g. Reinstall) • Manual bridge between the two Moving towards Closed-loop Operations • Collect targeted metrics • Take directed, problem-specific actions • Automate actions based on policy

  12. PlanetLab Operations • Close the monitor/management cycle • Direct automation of common operations • Indirect through remote contacts and incentives

  13. MyOps Architecture • Collection from Node • Translated by policy to Automated action

  14. MyOps Architecture • Collection from Node • Send notice to Local contact to take action

  15. MyOps Architecture • When there is no response • Indirect influence with incentives

  16. Collection • Operational monitoring specific targets, such as: • Boot status, Filesystem status • DNS - internal and external • RPMs • System services, etc • Periodic collection • Coarse-grained collection at a human-timescale • Time-series of events and status

  17. Policy • Constraints over a time-series of events • To satisfy a constraint • Automated action • Send notice • Apply incentive • Policy defines • Preferred status of system • Frequency of actions • Magnitude of incentives

  18. Automation • Automatic correction of common bootstrap problems • Communication errors with MyPLC • Corrupt filesystem repair • Retry when state is unknown • PCU Reboot • Reinstall • Automation Notices • Bad disk • Minimal hardware • Bad DNS • Bad node configuration

  19. Notices & Incentives • Notices are indirect paths to node management • Node down / online / specific problem (i.e. DNS, disk) • Site down / online • Privilege reduced / restored • PCU errors • The incentives on MyPLC • Sites 10 slices • Disable slice creation • Disable running slices

  20. Validation of Notices & Incentives A B C D E Kernel Bug Fix Fix2 Notice Bug Fix

  21. Time to Restore Down Node (all issues)

  22. Future Ideas • Generalize Configuration • Collect from multiple sources • Expose policy • Act on multiple targets • Self-monitoring • Positive Incentives • Special access to services • Additional resources (Slices, Bandwidth, CPU, etc)

  23. Time to Reply (when there is a reply)

More Related