Managing the Performance Impact of Administrative Utilities

Managing the Performance Impact of Administrative Utilities Paper by S. Parekh ,K. Rose , J.Hellerstein , S. Lightstone, M.Huras, and V. Chang Presentation and Discussion Led by N. Tchervenski CS 848, University of Waterloo November 1, 2006

Outline • Introduction – performance impact of administrative utilities • Proposed solution • Architecture and Control Theory • Tests performed • Conclusion • Discussion

Performance Impact of Administrative Utilities • Administrative utilities • Essential to the system • Have performance impact • With 24/7 operation, it is never a good time to suffer performance degradation • Solution: find a way to slow down

Example of DB Running a Backup * Throughput and response time averaged over 60s intervals

How to Slowdown a Utility • Performance impact is dynamic – both for utilities and regular workloads (WLs) • Low level approach • per-resource quotas / priorities • difficult to manage • Admin Utility Performance Policy - at most x% degradation of production work • How to throttle utilities  SIS – self-imposed sleep • How to translate policy requirement vs. throttling units?

SIS – Self-imposed Sleep

Action Interval and Sleep Fraction • Action interval = workTime + sleepTime • With action interval being constant, we need just sleep fraction: • Sleep fraction = sleepTime / action interval • Sleep fraction = 0  unthrottled, 1  stopped • Suggested value for action interval is at least a few iterations of the “main-loop” of the utility

Throttle Manager Architecture Action interval = const PI controller X% sleepTime Linear model based on <sleepTime , performance>

Degradation Estimator • Baseline estimator – system performance w/o utilities • Degradation = 1 – performance / baseline • How to determine baseline? • Stop all utilities  WL surges, short-term performance, underutilize resources • Linear fitting of <sleepTime, performance> • Performance = f(sleepTime) = Q1*sleepTime+Q0 • Recursive least squares and exponential forgetting

Linear Fit Example of Sleep/Throughput Actual baseline Estimated baseline Steady workload, backup throttling kept constant for 20 minute intervals

Controller • Goal: current degradation = degradation limit • Error = degradation limit – current degradation • PI controller used • Throttling(k+1) = Kp * error(k) + Ki * Sum(error(i), i=0..k) • Kp – proportional gain – used to increase speed of response • Ki – integral gain – eliminate steady state error • Kp, Ki and control interval can be hard-coded or determined at runtime • Kp and Ki can be estimated by utilizing pole placement from control theory, but experimental results are necessary to confirm results [2] • Experiments in this paper: • control interval = 20 seconds • Kp and Ki same across all experiments

Tests Performed • Testbed description • DB2 v8.1, 4-CPU RS/6000, 2GB ram, AIX 4.3, 8 physical disks • Workload similar to TPC-C • Initial “warm-up” period of 10 minutes, to stabilize system / bufferpools /etc. • Utility used – parallelized BACKUP – multiple processes reading from multiple tablespaces, and multiple other processes writing to separate disks

OS Priorities vs SIS (Sleep fraction) Linear effect when throttling using sleep No performance gain by changing OS priority of backup process WL alone 100% throttling OS priority works for CPU intensive WLs, here we have I/O intensive WL. CPU is idle 80% of the time.

Dynamic Effect of SIS. Does “Turning the Knob” Actually Do Something? 15tps avg Backup started As in previous slide, we don’t get back to 100% throughput when fully throttled, but we’re close.

Feedback Control X=30% degradation policy

Feedback Control Effectiveness • Without BACKUP – 15tps • With x=30%, steady workload – 25 users •  9.4tps  38% degradation • Why the throttling slump? • Throttling system compensates for decreasing resource demands of the backup? • With x=30%, Workload surge at 1500s – from 10 users to 25 users. • Pre-surge degradation of 36% • Post-surge degradation of 19% • Still good results, close to the 30% policy

Causes for Deviation • Baseline estimator – actual throughput is 15.1 tps vs projected value of 13.2tps.. • System stochastics not always estimate degradation correctly • For example, the drop of throttle at t=1800s • Quick to self-correct  correct results in the long term • Short-term violations could be avoided by trading adaptation speed by adjusting the forgetting factor in online estimator.

Conclusion • Administrative utilities must be run, but there is no timeslot for them • Proposed an application-based throttling mechanism – need to change applications code only, but OS/system independent • Easy for administrators to just specify degradation policy • Applicable to various systems • Main requirements • Utility work be identifiable – put sleep there • Performance can be measured and w/o much overhead

Limitations and Future Work • Test on multiple utilities • Throttle each utility separately? • Propose and analyze different approaches for the controller • PI algorithm, recursive least squares estimator, etc. How to specify parameters for them? • Automate determination of controller parameters as they are system dependent.

Discussion • Why the throttling slump on the feedback control? • Even when backup is fully throttled, system may not reach peak performance as before, since it needs more time to stabilize (i.e. bufferpools again). This may be a better explanation for the difference between projected baseline and the actual baseline. • Even if tasks were CPU intensive, assigning them priority by OS is not guaranteed to work, since they may interact with other parts of the engine – issue queries, etc.. Can’t slow the engine for that. • Obviously this works since it’s been implemented in DB2 v8 and v9 – backup / rebalance / auto-runstats – all I/O intensive tasks. • Other ways to limit/control the impact of backup to DB system. Controlling bufferpools / memory. Automatic tuning of memory is introduced in DB2 v9. • How to handle peak loads? How do we guarantee QoS? • Can we monitor not only TPS output, but try to “expect” what the WL performance would be, based on # of clients , # of queries compiled/executed, bufferpool activity/misses?

References [1] Sujay Parekh, Kevin Rose, Joseph L. Hellerstein, Sam Lightstone, Matthew Huras, and Victor Chang. Managing the performance impact of administrative utilities. In Self-Managing Distributed Systems - 14th IFIP/IEEE International Workshop on Distributed Systems: Operations and Management (DSOM 2003), number 2867 in Lecture Notes in Computer Science. Springer-Verlag, 2003. [2] Diao,Y,Gandhi,N.,Hellerstein,J.L.,Parekh,S.,Tilbury,D.M.:Using MIMO feedback control to enforce policies for interrelated metrics with application to the Apache web server. In: Proceedings of Network Operations and Management. (2002)

Managing the Performance Impact of Administrative Utilities

Managing the Performance Impact of Administrative Utilities

Presentation Transcript

Managing the Performance of Teleworkers

Managing the performance athlete

Managing Performance Priorities

Managing poor performance

Performance Measurement Utilities

The Impact of Variability on Process Performance

Managing Performance

Performance Managing

Managing Contractor Performance

MANAGING FOR PERFORMANCE

Managing for performance

Managing the Performance of Homecare Medicines Services

Managing Employee Performance

Managing PHP Performance

Northeast Utilities Transmission – Managing Capital Projects

Managing the Council’s Performance

Managing the Impact of Wind Turbines on Aviation

The Role of Utilities

Managing Employee Performance

Find Proper Solutions for Managing the Utilities of your Business

MANAGING STAFF PERFORMANCE

Managing PHP Performance