Bridging the Performance Gap: A Practical Guide to Achieving Consensus in Distributed Systems

Closing the WTF – NTF Gap A whimsical but yet highly practical guide to getting to ‘yes’ in the Distributed Performance Arena David Halbig Thursday, Oct 1, 2009

Agenda • Non-Distributed Model • Distributed Model (traditional) • Case Studies • Distributed Model (proposed) • Summary

Non-Distributed Diagnostic Model • One Riot / One Ranger (Some multi-disciplinary team investigations) • Response time or Deadline-centric Objectives, Instrumentation, and Tools

Midrange Diagnostic Model (traditional) • Assemble large team with (at least) • One representative from each environment component (applications, network, DB, server (one per class)) • Project Manager(s) • Misc HLEs, depending on the severity of the slowdown/outage • Requesting status from each environmental component • Begin (in no particular order) • Recycling servers • Removing components from the environment • Blamestorming / RoT / Request diagnostics from components • (rinse, repeat)

Key Characteristics of This Model • Fragmented Instrumentation • Little primary information sharing • At ‘jurisdictional boundaries’ • No agreed-on SLAs • No agreed-on metrics • No common tool use • Focus on Utilization, not Delivered Service

Case Study #1 - VDI environment • 450+ Virtual Desktop Instances (WinXP on Vmware ESX 3.5) • Geographically dispersed user community • Varied workload characteristics, from CSR support to code development • Intermittent severe response time problems across random selection of VDI’s • All major components reported back ‘NTF’

Case Study #1 – VDI environment C: \ Drive Seconds/Read

Case Study #1- VDI environment • Perfmon analysis indicated intermittent severe I/O response time problems • Utilization-centric reporting from SAN layer reported no severe problems • Vmware layer reported utilization, but no response time data for SAN layer • Esxtop data, with 30-second reporting interval showed intermitted severe I/O response time problem @ HBA

Case Study #1 – VDI environment • SAN reporting did not include all layers • SAN reporting was too coarse a granularity (15 mins) • SAN reporting upgraded to report response time and to regularly report with finer granularity (30 seconds) • (soooo… what was the problem?)

Case Study #1 – VDI Environment RTVSCAN - DIO

Case Study #2 – eCustomerService • Moderate-volume web-based application • Facing retail card holder population • Intermittent response time delays

Case Study # 2 - eCustomerService • Network trace shows ‘declining TCP/IP window size’ • OS team reports ‘NTF’ • Response time decomp tool reports delay between web and app layer

Case Study #2 - eCustomerService

Case Study # 2 – eCustomerService

Case Study # 2 - eCustomerService • Conflict between ‘I know’ and ‘I believe’ resolved by mgmt intervention • OS vendor engaged for deep dive into TCP/IP stack and web application

Midrange Diagnostic Model (proposed) • End-to-end transaction monitoring • Explicit response time decomposition • For crucial subsystems (example: I/O), full chain-of-custody instrumentation • At ‘Jursdictional Boundaries’: • Agreed-on metrics (response time) • Agreed on instrumentation (tool/interval)

Midrange Diagnostic Model (proposed) • Train to common end-to-end tool • Train to common component tools • START with response time/Delivered Service metrics/tools, END with utilization centric metrics/tools • Approach other teams with probable cause only • Staffing/authority model(s): • Trained performance analysts with ‘hot pursuit’ authority • Trained performance analysts with advisory authority (only)

Bridging the Performance Gap: A Practical Guide to Achieving Consensus in Distributed Systems

Bridging the Performance Gap: A Practical Guide to Achieving Consensus in Distributed Systems

Presentation Transcript

IMPROVING ACHIEVEMENT AND CLOSING GAPS BETWEEN GROUPS

802.11 July 2009 Closing Plenary Reports

NASA SHARP CLOSING CEREMONY

802.11 Nov 2011 Closing Plenary Reports

802.11 March 2013 Closing Reports

802.11 July 2013 Closing Reports

802.11 Mar 2010 Closing Plenary Reports

Unaudited Agencies 2014 Closing Process Training

802.11 March 2013 Closing Reports

802.11 May 2014 Closing Reports

Raising Achievement and Closing Gaps in Michigan: What Do We Know About What It Will Take?

802.11 May 2012 Closing Reports

802.11 Nov 2012 Closing Reports

GEO Task AR-07-02 AIP-2 Kickoff Workshop Closing Plenary

Building A Movement: Closing Achievement and Opportunity Gaps in California. What, Why, and How?

Closing the Achievement Gap by Improving Instruction in the Core

Chapter 13

Final night at DNC