440 likes | 446 Vues
Control Theory in Log Processing Systems. Wei Xu ( xuw@cs.berkeley.edu ) UC Berkeley Joseph L. Hellerstein IBM T.J. Watson Research Center. Outline. Data streams and log processing Applying control theory Controlling queue length Load balancing Lessons learned. Introduction.
E N D
Control Theory in Log Processing Systems Wei Xu (xuw@cs.berkeley.edu) UC Berkeley Joseph L. Hellerstein IBM T.J. Watson Research Center
Outline • Data streams and log processing • Applying control theory • Controlling queue length • Load balancing • Lessons learned
Introduction • Goal of our project • A tool • A testbed • Problem: data rate up to 1 TB a day • Distributed Infrastructure • How to make itself reliable?
Example of system log data • request data • Apache log, etc • performance data • CPU, mem etc. • failure data • Detected problems /error messages • reports from operators
raw logdata The big picture Production System Data Collection Automatic analysis preprocessing ? Repository Sanitized Data Failure Detection
Preprocessing • Sanitize the data • Put logs into common format • Merge information from various sources • Sampling • Needs to be fast
SLT query Q Stream processing • Log data are data streams • Preprocessing tasks are continuous queries • Telegraph Continuous Query (TCQ) • SQL queries • adaptive: execution optimized on-the-fly • performance doesn’t depend on #queries
TCQquery Q TCQquery Q TCQquery R 6+5+4 3+2+1 4 1 4 1 TCQquery Q 6 5 4 3 2 1 5 2 5 2 6 6 5 5 4 4 3 3 2 2 1 1 6 3 6 3 Data preprocessing architecture load splitter combiner SLT 1 SLT 2 Intra-Event Processing Inter-Event Processing
Problem: performance disturbance • CPU contention • Maintenance Tasks • Packets drop • Other failures SELECTIVITY changes
The result of disturbance End to End Response time (ms) Time (second)
Solution – Control Theory • Treat this as a failure? • Not necessary and too expensive • Feedback control theory as first tier defense mechanism • Try to make it stable at least for sometime • If doesn’t work out, try recovery
Outline • Data streams and log processing • Applying control theory • Controlling queue length • Load balancing • Lessons learned
The problem Source Buffer TCQ Result Q
Why does this happen? TCQ Complex internal structure Controlled Data Source Input Buffer TCQ drops tuples silently if result queue is full Back pressure not possible
Control Problems • Goal? • No dropping tuples • What to control? • The result queue length • The Knob? • Input data rate to the TCQ node
Control block diagram Target system (System identification) u(k)=u(k-1)+(Kp+KI)e(k)-Kpe(k-1) Error Data rate in next interval Last Error Data rate in last interval
Result – Under CPU Contention Source Buffer TCQ Result Q
Why useful? • Original system • Input data rate =>tuple drop v.s. not drop • New system • Input data rate => Response time • Make it ready for load balancing
Outline • System log as data streams • Applying control theory • Controlling queue length • Load balancing • Lessons learned
The problem • Barrier in system • Different response times • End to end response time matches the slower node
The control problem • Goal? • Make the response time equal • What to control? • Response time on each node • The knob? • Tuples assigned to each node • What to monitor? • Queue length v.s. response time
System with control Response time
Result End to End Response time (ms) Time (second)
Outline • System log as data streams • Applying control theory • Controlling queue length • Load balancing • Lessons learned
Advantages of control theory • Performance can be analyzed • Stability • Accuracy • Settling time • Overshoot
Other advantages • Simple implementation • Encourage good system design • Modeling the system • Treat system as black box • First defense mechanism against disturbances in system
Limitations • Not all software systems are designed to be controlled • Finite input produces unbounded output • E.g. Join in TCQ • Useful state not measurable • Queuing theory helps, but lacks other good theory • Many binary variables • Failed v.s working correctly
Other Limitations • The model for target system is complex • Lack of a reliable knob • E.g. change result queue length of TCQ – sometime it crash • What is the range you can turn? • How often you can turn? • How long will the system respond? • Can not find the cause of problem
Solution? • More advanced modeling and controller? • Adaptive control • Design controller-friendly systems? • A simple model • User configurable parameter -> knobs?
Future Work • As a tool, real users? • Scheduling multiple streams • Dynamically scale up/down • Other control theory applications
Future Work • Load balancer • Load control across multiple tiers • Scheduling of multiple streams
Controlled Data Source Output Rate Controller Queue Length Monitor System with control
Result Source Buffer TCQ Result Q
Conclusion • Advantages of feedback control • Make system more robust under disturbance • Allows more time for failure detection • Treat complex systems as black boxes • Cope with the system characteristics instead of having to change it • Theoretical analysis • Implementation is easy • System statistics can also be used for SLT
What is going on? Controlled Output Thread(Code Reuse) Desired Queue length Queue Length Controller Data Rate to TCQ Actual Queue Length
Output Y from simulation Theory meets reality Queue length Time
Tricky part of parameter estimation Model evaluation – Making the system operate in desired range Data rate vs free space Free Space Non-Linear range Easy for data source, but queue length ..
Why do we need control? • Data source does not provide accurate data rate
Control Problems • Not accurate for various reasons • Scheduling • Time spent on I/O • Etc. • Providing an accurate data source using feedback control • By controlling the input of “desired rate”
PI Controller The Control Architecture 1500 1900 1600 P Controller (with precompensation) u(k)=Kp*e(k) U(k)=u(k-1)+(Kp+KI)e(k)-Kpe(k-1)
Result – An accurate data source P Controller with Pre-compensation PI Controller
Zoom In A lot of small disturbance in a Java program Incremental garbage collection P Controller PI Controller