Download
closing the wtf ntf gap n.
Skip this Video
Loading SlideShow in 5 Seconds..
Closing the WTF – NTF Gap PowerPoint Presentation
Download Presentation
Closing the WTF – NTF Gap

Closing the WTF – NTF Gap

93 Vues Download Presentation
Télécharger la présentation

Closing the WTF – NTF Gap

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Closing the WTF – NTF Gap A whimsical but yet highly practical guide to getting to ‘yes’ in the Distributed Performance Arena David Halbig Thursday, Oct 1, 2009

  2. Agenda • Non-Distributed Model • Distributed Model (traditional) • Case Studies • Distributed Model (proposed) • Summary

  3. Non-Distributed Diagnostic Model • One Riot / One Ranger (Some multi-disciplinary team investigations) • Response time or Deadline-centric Objectives, Instrumentation, and Tools

  4. Midrange Diagnostic Model (traditional) • Assemble large team with (at least) • One representative from each environment component (applications, network, DB, server (one per class)) • Project Manager(s) • Misc HLEs, depending on the severity of the slowdown/outage • Requesting status from each environmental component • Begin (in no particular order) • Recycling servers • Removing components from the environment • Blamestorming / RoT / Request diagnostics from components • (rinse, repeat)

  5. Key Characteristics of This Model • Fragmented Instrumentation • Little primary information sharing • At ‘jurisdictional boundaries’ • No agreed-on SLAs • No agreed-on metrics • No common tool use • Focus on Utilization, not Delivered Service

  6. Case Study #1 - VDI environment • 450+ Virtual Desktop Instances (WinXP on Vmware ESX 3.5) • Geographically dispersed user community • Varied workload characteristics, from CSR support to code development • Intermittent severe response time problems across random selection of VDI’s • All major components reported back ‘NTF’

  7. Case Study #1 – VDI environment C: \ Drive Seconds/Read

  8. Case Study #1- VDI environment • Perfmon analysis indicated intermittent severe I/O response time problems • Utilization-centric reporting from SAN layer reported no severe problems • Vmware layer reported utilization, but no response time data for SAN layer • Esxtop data, with 30-second reporting interval showed intermitted severe I/O response time problem @ HBA

  9. Case Study #1 – VDI environment • SAN reporting did not include all layers • SAN reporting was too coarse a granularity (15 mins) • SAN reporting upgraded to report response time and to regularly report with finer granularity (30 seconds) • (soooo… what was the problem?)

  10. Case Study #1 – VDI Environment RTVSCAN - DIO

  11. Case Study #2 – eCustomerService • Moderate-volume web-based application • Facing retail card holder population • Intermittent response time delays

  12. Case Study # 2 - eCustomerService • Network trace shows ‘declining TCP/IP window size’ • OS team reports ‘NTF’ • Response time decomp tool reports delay between web and app layer

  13. Case Study #2 - eCustomerService

  14. Case Study # 2 – eCustomerService

  15. Case Study # 2 - eCustomerService • Conflict between ‘I know’ and ‘I believe’ resolved by mgmt intervention • OS vendor engaged for deep dive into TCP/IP stack and web application

  16. Midrange Diagnostic Model (proposed) • End-to-end transaction monitoring • Explicit response time decomposition • For crucial subsystems (example: I/O), full chain-of-custody instrumentation • At ‘Jursdictional Boundaries’: • Agreed-on metrics (response time) • Agreed on instrumentation (tool/interval)

  17. Midrange Diagnostic Model (proposed) • Train to common end-to-end tool • Train to common component tools • START with response time/Delivered Service metrics/tools, END with utilization centric metrics/tools • Approach other teams with probable cause only • Staffing/authority model(s): • Trained performance analysts with ‘hot pursuit’ authority • Trained performance analysts with advisory authority (only)