140 likes | 257 Vues
In "Ten Minutes on Five Nines," Terry Gray, Associate VP of IT Infrastructure at the University of Washington, explores the delicate balance between reliability, responsiveness, and complexity in IT systems. As dependency on IT rises and tolerance for outages decreases, organizations face challenges in maintaining high performance, low maintenance costs, and security. With insights on failure management, organizational culture, and security threats, Gray emphasizes the need for effective strategies to overcome inevitable complexities and improve overall system reliability.
E N D
Ten Minutes on Five Nines Terry Gray Associate VP, IT Infrastructure University of Washington <a last minute recruit for…> Common PROBLEMS Group 6 January 2005
Vision • Systems/Services (and Staff!) characterized as Reliable and Responsive • Reliability = job one • But: I.T. = Inevitable Tensions • We all want: • High MTTF, Performance and Function • Low MTTR and support cost • The art is to balance those conflicting goals • we are jugglers and technology actuaries
Tom’s Nobody gets hurt Nobody goes to jail Terry’s “Works fine, lasts a long time” Low ROI (Risk Of Interruption) Success Metrics
Fault Zone size vs. Economy/Simplicity Reliability vs. Complexity Prevention vs. (Fast) Remediation Security vs. Supportability vs. Functionality Networks = Connectivity; Security = Isolation Balancing priorities (security vs. ops vs. function) Design Tradeoffs
Context: A Perfect Storm • Increased dependency on I.T. • Decreased tolerance for outages • Deferred maintenance • Inadequate infrastructure investment • Some extraordinarily fragile applications • Fragmented host management • Increasingly hostile network environment • esp. spam, spyware, social engr attacks • Increasing legal/regulatory liability • Highly de-centralized culture • Growth of portable devices
Environmentals (Power, A/C, Physical Security) Network Client Workstations (incl. portable devices) Servers Applications Personnel, Procedures, Policy, and ArchitectureFailures at one level can trigger problems at another level; need Total System perspective System Elements
How often is there a user-visible failure? How many people are affected? For how long? How severely? Dimensions
How many nines? Problem one: what to measure? How do you reduce behavior of a complex net to a single number? Difficult for either uptime or utilization metrics Problem two: data networks are not like phone or power services… Imagine if phones could assume anyone’s number Or place a million calls per second! Basics
Obviously lack of security is bad… but: Defense in depth is not free Each add’l defensive perimeter increases MTTR Defense-in-depth conjecture (for N layers) Security: MTTE (exploit) N**2 Functionality: MTTI (innovation) N**2 Supportability: MTTR (repair) N**2 Next-gen threats: firewalls won’t help Security vs. Reliability
How do you measure avail in complex systems? Death of the Network Utility Model Organizational vs. geographic networking SAN virtualization Web load-leveler appliances Organizational boundary conditions Networks: from stochastic to non-deterministic Subnets with clients and critical servers Documentation deficiencies Complexity vs. Reliability
Jan 2004 (?) IEEE Spectrum on Power Grid failures Point: it will happen, so plan for mitigation Complex System Failures: Inevitable?
New trouble-ticket system New network management system Next-generation network architecture Next-generation security architecture Improving change control process Improving DRBR process Lots of work on improving mon/diag tools Work in Progress
In Short… • Expectations are growing (unrealistically?) • Complexity is growing • Few are prepared to pay for true HA • Cultural barriers to change control • Hospitals are a whole other world • Biggest SPoF: power/HVAC • Organizational complexity undermines HA • Both security and lack of it undermine HA • Redundancy can mask failures too well! • With redundancy, must have better tools • Need Ops-centric design, better DRBR • Need application procurement standards