Systems Support for End-to-End Performance Management

Systems Support for End-to-End Performance Management Sandip Agarwala PhD Advisor: Karsten Schwan College of Computing Georgia Tech

Complexity, complexity, complexity… Source: Gartner (December 2005)

Reasons for Complexity • Application diversity • Interdependencies • Heterogeneous components • Too many different technologies and platform • Too little “hints” from the system to the administrators • Legacy issues; Application-specific solutions • Insufficient information about the system to drive self-management  Lack of Automation

Online System Management Analyze Monitor Control Execute Workload • Scheduling • Capacity and SLA management • Design evaluation and tuning • Bottleneck detection • Resource provisioning, accounting, etc. Proposed Approach: Service Path

Service Path Data Base Back - end Application Logic (EJBs, etc.) Middle-tier Servlet Server Front - end Web Servers Proxy Server I n t e r n e t • System abstractions that describe the dynamic dependencies between the different distributed application components • Service Class: Application-level request class, e.g. SLA class

Service Path Characteristics • End-to-End analysis • Online • Non-intrusive • Application-generic

Outline • Background • Motivation • Service path • Discovery with E2EProf • Refinement with SysProf • Automated SLA Enforcement • Related Work • Future Plans

E2EProf D1 A C time B D2 X D time (AB) time (BC) time • Black-box approach • Correlate per-edge time series signals • Monitor network packet traces (source, destination, timestamps) Model traces as per-edge time series signals or density functions

Basic Approach (AB) (BD) (AB) (BC) Delay at B • Compute cross-correlation (D1 D2) A C B X D SpikeCausality Spike’s position Delay No spike

Evaluation with 4-tier RUBiS1 Tomcat Server 1 EJB Server 1 Clients I/O bound MySQL Server Apache Web Server comment bidding CPU bound Tomcat Server 2 EJB Server 2 1http://rubis.objectweb.org/

Service Path Detection in RUBiS Round-robin load balancer Highest delay node Highest delay nodes Highest delay node Static server assignment

Change detection in RUBiS Injected Delay

Delta Air Lines’ Application Revenue Pipeline Total Traffic: 1.34 million / day (56k / hour) TACSIN & TACSOUT APEXIN & APEXOUT Error/Warning (Tivoli) Logs XIN & XOUT

Delta Air Lines’ Application Client requests TACS Latency (sec) S1 S2 S3 S7 S8 Time of the day TACS Huge request burst

Outline • Background • Motivation • Service path • Discovery with E2EProf • Refinement with SysProf • Automated SLA Enforcement • Related Work • Future Plans

Beyond dependency and latency… S2 S6 C1 S4 S1 C2 S3 S5 • Solution: Zoom into the servicepath with SysProf • No application hints or instrumentation • Monitor resource usage on per-class basis

SysProf Methodology • Track request context • Work done for processing a request class • May span user-level or kernel-level • Executes in more than one contexts (e.g. processes, threads, softirqs) • Happens in a system-visible event (e.g. system calls) system call parameters, PID, App functions A1 A2 AN User Kernel Scheduler System Call Scheduler Net softirq Network Stack FS/ VM/ etc. Context Switches Context Switches Disk I/O Init CID eth driver BDD From client Instrumentation points To client

Class ID Propagation Process  CID Msg  CID Middle-Tier End-Tier Front-Tier User Kernel Init CID From client To client Packet  CID Inherits CID

Application of SysProf • Resource Accounting • Utility Billing • Bottleneck detection • Capacity Estimation • Root-Cause Analysis • Black-Box SLA management

Resource-Aware Adaptive Control Separate Queue/Controller for each cluster Tomcat Server 1 EJB Server 1 Controller + Scheduler MySQL Server Class 1 Class 2 Front-end Tomcat Server 2 EJB Server 2 Class 3 Cluster workloads contending for same resources

Resource-Aware Adaptive Control Capacity = 80 req/s per server No SysProf With SysProf

Summary • Service Path • System abstractions to represent dependencies and request path • E2EProf and Pathmap • Dependency and latency analysis • SysProf • Service-based resource analysis • Aid human operator and automate end-to-end performance management

Thank You! Questions? Email: sandip@cc.gatech.edu

Extra Slides

Pathmap Optimizations time time Packet timestamp trace Bursty traffic Sliding window (W) W Run-length compression Time-series signal Or Density Function Upper-bound On latency time Cross-correlation series

Systems Support for End-to-End Performance Management

Systems Support for End-to-End Performance Management

Presentation Transcript

Maximizing End-to-End Network Performance

Internet2 End-to-End Performance Initiative

ROVER for End-to-End Seismic Risk Management

Measurements in Support of End-to-End Performance

End-to-End Performance Analytics For Mobile Apps

Maximizing End-to-End Network Performance

End to End Performance Initiative

End-to-End Performance Initiative

End-to-End performance tuning

SIP End-to-End Performance Metrics

End to End Performance Initiative

Maximizing End-to-End Network Performance

End to end Internet Performance today

Swarm End-to-End Mission Performance Simulator

End To End Performance Based Hiring

Mainframe End-to-End- Operation Support

Maximizing End-to-End Network Performance

End-to-end Data Management for Operational Observation Systems

Internet2 End-to-End Performance Initiative

End-to-End Performance Initiative

End-To-End eCommerce Support