Supporting Time-Critical Event Processing in Grids and Clouds

Supporting Time-Critical Event Processing in Grids and Clouds • Qian Zhu • Advisor: Professor Gagan Agrawal

Adaptive Applications Earthquake modeling Coastline forecasting Medical systems • Time-Critical Event Processing • Compute-intensive • Time constraints • Application-specific flexibility • Application Quality of Service (QoS)

Adaptive Applications that perform time-critical event processing Adaptive Applications (Cont’d) HPC Applications (compute-intensive) Deadline-driven Scheduling • Aim at maximize performance • Do not consider adaptation • Not very compute-intensive • Application-specific flexibility: parameter adaptation • Trade-off between application QoS and execution time

Motivating Application－ Real-time Volume Rendering • Interactively create a 2D projection of a large time-varying 3D data set • Application Flexibility • Error tolerance (image quality) • Image size • Benefit definition (QoS metric) • To view the rendered images from as many angles as possible • For each view angle, display the image with the best resolution at the desired image size

Motivating Application－ Real-time Volume endering • Example (a) (b) Note: RMI data set from Lawrence Livermore National Laboratory • How well can we do given 1 minute as the time constraint ?

1km 1km Motivating Application－ Great Lake Nowcasting and Forecasting • Monitor meteorological conditions of the Lake Erie for nowcasting and forecasting

Motivating Application － Great Lake Nowcasting and Forecasting • Application flexibility • Resolution of grids • Internal time step • External time step • Benefit definition (QoS metric) • To predict the water level first • To predict other meteorological information as much as possible • How much meteorological information can we predict given 1 hour?

Goal: Maximize the application benefit (QoS) while satisfying the pre-specified time constraints Time Critical Event Processing • Grid Computing Environment • Geographically distributed • Heterogeneous • Unreliable • Cloud Computing Environment • On-demand resource availability • Pay-as-you-go pricing model

Grid Cloud Resource Provisioning Power Management Fault Tolerance Resource Allocation Dissertation Overview Scientific computing Mobile applications Parallel computing Adaptive applications that perform time-critical event processing Parameter Adaptation

Challenges -- Parameter Adaptation • A Large Number of Parameters to be Adapted • Discrete and continuous • Correlations between parameters • No Knowledge about the Impact of Such Parameters on Execution Time or Benefit • Pre-specified Time Constraints • Low adaptation overhead

Benefit Value Resource Configuration Challenges -- Resource Allocation • Grid/Cloud: Heterogeneous and Dynamic Resources • Resource Allocation Impacts Application Benefit • A 20-min event from Volume Rendering application • Different CPU, Memory and/or Bandwidth Usage • Different application components • Different value of adaptive parameters

Challenges -- Fault Tolerance • Grid Resources • Heterogenous and Unreliable • Time Constraints • Trade-off between Resource Efficiency and Reliability • Effective, Low-overhead Failure Recovery

Challenges -- ResourceBudget Constraints • Elastic Cloud Computing • Pay-as-you-go model • Satisfy the Application QoS with the Minimum Resource Cost • Dynamic Resource Provisioning • Dynamically varying application workloads • Resource budget

Goal: Maximize the application benefit (QoS) while satisfying the pre-specified time constraints Contributions • Parameter adaptation • Q. Zhu and G. Agrawal (ICAC2008) • Resource allocation • Q. Zhu and G. Agrawal (IPDPS2009) • Fault tolerance • Q. Zhu and G. Agrawal (SC2009) • Budget constrained resource provisioning • Q. Zhu and G. Agrawal (HPDC 2010) • Power-aware consolidation of workflows • Q. Zhu, J. Zhu and G. Agrawal (submitted to SC2010)

Roadmap • Motivation and Introduction • Parameter Adaptation in the Grid Environment • Application model • Autonomic adaptation algorithm • Resource allocation in time-critical event processing • Budget Constrained Resource Provisioning • Power-aware Consolidation of Workflows • Future Work • Conclusion

Contributions • Develop an Autonomic Adaptation Algorithm • Effectively adjust the parameters • Low overhead • Design of an Adaptive Middleware with Support of Easy Deployment of Applications in Grid Environments • Consider Heterogeneous Resources • Efficiency value definition • Efficiency value estimation • Greedy-based scheduling algorithm

WSTP Tree Construction Service Compression Service Temporal Tree Construction Service Unit Image Rendering Service Decompression Service Image Composition Service Application and Environment Model • Volume Rendering application

Goal: Maximize the application benefit (QoS) while satisfying the pre-specified time constraints Algorithm Overview • Train system model • Learn the relationship between the values of adaptive parameters and executiontime, application benefit Normal Processing Phase Event Handling Phase Input Data Input Data Time Constraint ... checkpoint 1 ... checkpoint 2 ... ... checkpoint 1 ... checkpoint 2 ... (collect data) (adjust parameters) (adjust parameters) (collect data) • Apply the trained system model for parameter adaptation

Parameter Adaptation to OptimalControl Model • Adaptation Process Performance Measure Controller u(k) D(k) D(k) w(k) Application • Control Policy • Policy with learning -- Reinforcement learning

Resource Allocation • Heterogeneous and Dynamic Resources • Different CPU, Memory and/or Bandwidth Usage • Different service components • Different values of adaptive service parameters • Schedule the Service Components to Maximize the Benefit Function Within the Time Constraint

Efficiency Value • Definition • Represent how efficient to execute a service on a node • Consider application benefit and execution time • Estimation • Based on fuzzy logic • Assign to and to yields the maximum benefit • Our definition of efficiency value captures the suitability of different nodes for different services

Roadmap • Motivation and Introduction • Parameter Adaptation in the Grid Environment • Budget Constrained Resource Provisioning • Background: Cloud environment • Dynamic resource provisioning algorithm • Framework Design • Experimental evaluation • Power-aware Consolidation of Workflows • Future Work • Conclusion

Background: Cloud Environment • Amazon EC2, Google AppEngine, Microsoft Azure, Magellan ... • Utility-like Computing • On-demand scalability of resources • Resource Cost • Pricing model: Pay-as-you-go • Virtualization • Resource sharing • Customized deployment and easy migration • Assumption: Fine-grained resource allocation (i.e., change CPU, memory on-the-fly) and pricing

Background: Pricing Model • Charged Fees • Base price • Transfer fee • Linear Pricing Model • Exponential Pricing Model

Problem Description • Adaptive Applications • Adaptive parameters • Benefit • Time constraint • Cloud Computing Environment • Resource budget • Overprovisioning/Underprovisioning • Goal • Maximize the application benefit while satisfying the time constraints and resource budget

Contributions • Dynamic Resource Provisioning Algorithm • Based on multi-input-multi-output feedback control model • Optimization to reduce provisioning overhead • Adaptive and SOA Oriented Framework • Support dynamic virtual CPU and memory allocation based on application requirements

Approach Overview • Resource Provisioning Controller • Multi-input-multi-output (MIMO) feedback control model • Modeling between adaptive parameters and performance metrics • Control policy: reinforcement learning • Resource Model • Map change of parameters to change in CPU/memory allocations • Optimization: avoid frequent resource changes Dynamic Resource Provisioning (feedback control) change to the adaptive parameters Resource Model (with optimization) change to CPU/memory allocations

0 0 0 • Satisfy time constraints and resource budget • Relationship between adaptive parameters and performance metrics • Decide how to change values of the adaptive parameters Resource Provisioning Controller Performance Metrics Multi-Input-Multi-Output Model Control Policy 0

Control Model Formulation -- Performance Metrics • Performance Metrics • Processing progress: ratio between the currently obtained application benefit and the elapsed execution time • Performance/cost ratio: ratio between the currently obtained application benefit and the cost of the resources that have been assigned • Notation Application benefit obtained at time step k Elapsed execution time at time step k Resource cost at time step k

Control Model Formulation -- Multi-Input-Multi-Output Model • Auto-Regressive-Moving-Average with Exogenous Inputs (ARMAX) • Second-order model • is ith adaptive parameter at time step k • are updated at the end of every interval Previous observed performance metrics Previous and current values of adaptive parameters

, subject to the time and resource budget constraints Action taken at time step k Application benefit Control Model Formulation -- Control Policy • : Maximize Application Benefit • Reinforcement learning (Q-Learning) • Reward function • : Minimize Control Overhead( ) • Proportional-Integral (PI) controller • Update Parameter Values

Resource Model • Offline Training • Collect Data Points: • Learn the Relationship Between the Values of the Parameters and CPU/memory Usage • Model Optimization • Avoid frequent change to CPU/memory allocations due to resource cost • Balance global CPU/memory among multiple services

Performance Analysis Priority Assignment Status Query Application Controller Resource Model Model Optimizer VM VM VM VM VM VM Service Wrapper Framework Design Application Performance Manager Resource Provisioning Controller Service Deployment Virtualization Management (Eucalyptus, Open Nebular...) Xen Hypervisor Xen Hypervisor Xen Hypervisor ... ... ...

Experiments Setup • Schemes Compared • Work-conserving • Static scheduling • Metrics • Benefit Percentage • Resource Cost • Emulated Cloud Environment • Xen 3.0 • , • ,

Resource Model Validation: Hardware Heterogeneity • Our model predicts CPU cycle and memory usage within 3% comparing to the actual resource usage • Model trained on homogeneous hardware (M1) and on heterogeneous hardware (M2 and M3)

Performance of Dynamic Resource Provisioning Algorithm • Considered both linear and exponential pricing models • In linear pricing model, Our approach performs 24% worse than Work Conserving

Performance of Dynamic Resource Provisioning Algorithm • Work Conserving costs 66% more than our approach does

Resource Provisioning Overhead • Optimal Execution: ideal resource configurations • Our approach performs 4%, 2%, 2%, 1% and 0.8% worse than the Optimal Execution

Roadmap • Motivation and Introduction • Parameter Adaptation in the Grid Environment • Budget Constrained Resource Provisioning • Power-aware Consolidation of Workflows • Opportunities for consolidation • Workload analysis • Consolidation algorithm • Experimental Evaluation • Future Work • Conclusion

Motivation • Another Critical Issue in Cloud Environment: Power Management • HPC servers consume a lot of energy • Significant adverse impact on the environment • To Reduce Resource and Energy Costs • Server consolidation • Minimize the total power consumption and resource costs without a substantial degradation in performance

Problem Description • Our Target Applications • Workflows with DAG structure • Multiple processing stages • Opportunities for consolidation • Research Problems • Combine parameter adaptation, budget constraints and resourceallocation with consolidation and power optimization • Challenge: consolidation without parameter adaptation • Support power-aware parameter adaptation -- future work

Contributions • A power-aware consolidation framework, pSciMapper, based on hierarchical clustering and an optimization search method • pSciMapper is able to reduce the total power consumption by up to 56% with a most a 15% slowdown for the workflow • pSciMapper incurs low overhead and thus suitable for large-scale scientific workflows

Opportunities for Consolidation: GLFS • GLFS nowcasts and forecasts meteorological information for Lake Erie • GLFS is compute-intensive • Individual tasks could incur low resource usage

Resource Usage of GLFS Task1 <500, 3, 600> <1000, 6, 600> <2000, 12, 1200> <1000, 6, 600>

Resource Usage of GLFS Task2

Observations • Periodic Behavior w.r.t. CPU, memory, disk, and network usage: Time Series • Average Resource Usage is Significantly Smaller than its Peak Value • Dependent on the Values of the Application Parameters and the Characteristics of the Host Server

Power Consumption Analysis • Resource Usage Activity • CPU, memory, disk and network • Server Consolidation • Virtualization • Interference of consolidated workloads

Power Consumption Analysis: Resource Usage • All resource activities impact power consumption • Variation in the CPU utilization has the largest impact • Memory footprint and cache activities also impact the consumed power

Power Consumption Analysis: Virtualization • Virtualization incurs very low power overhead • Contention of CPU cycles • Dynamic CPU provisioning saves power

Power Consumption Analysis: Interference • Consolidating dissimilar workloads incur a small slowdown in the execution time and large savings in power and resource costs • Consolidating workloads with similar resource requirements significantly increase the execution time

Supporting Time-Critical Event Processing in Grids and Clouds