270 likes | 395 Vues
Middleware alternatives for storm surge predictions in Windows Azure. Beth Plale Kavitha Chandrasekar Milinda Pathridge Craig Mattocks Samindra Wijeratne. The Storm Surge.
E N D
Middleware alternatives for storm surge predictions in Windows Azure Beth Plale KavithaChandrasekar MilindaPathridge Craig Mattocks SamindraWijeratne
The Storm Surge • An immediate and dangerous impact of climate change is change in strength of storms that form over oceans making coastal communities increasingly vulnerable to storm surge. • There have already been indications that even modest changes in ocean surface temperature can have a disproportionate effect on hurricane strength and damage inflicted by these storms.
Sea, Lake and Overland Surges from Hurricanes (SLOSH) model : hydrodynamic coastal ocean model Small input, small output, 15,000 3-min jobs
Plots of maximum water surface elevation from historical Hurricane Andrew (1992) test case generated by SLOSH storm surge prediction model
Storm Surge Application • Sea, Lake, and Overland Surges from Hurricanes (SLOSH) model • Run 48 hours before landfall of Hurricane • Single run has sensitivity to track error and resultant falling off in accuracy of forecasts beyond 24 hours • Scientists run large-scale ensembles of tasks (14,000 instances) • Input and Output data < 10 GB; 3-5 minutes per task • Input contains • Description of basin (coastal region) being modeled • Track file per task • Can run on Windows
Platform-as-a-Service for Computational Science • Cloud computing is particularly well suited to on-demand High Throughput Parallel Computing tasks • Platform-as-a-Service (PaaS) platform should in theory offer attractive alternative to infrastructure-as-a-service for computational science problems • Conduct performance evaluation using three approaches to executing a high-throughput storm surge application • Sigiri, a large scale resource abstraction tool • Windows Azure HPC scheduler • Daytona, an Iterative Map-reduce runtime for Azure
Steps in workflow • SLOSH Pre-processing: • setup application generates all possible storm tracks; each track fed as input to one SLOSH ensemble task • SLOSH Post-processing: • Storm surge results can be integrated with roadway network and traffic flow information, rainfall amounts, river flow, or wind-driven waves to analyze at-risk coastal regions • SLOSH output can also feed other cloud data services such as for emergency response services • Data Stage-In and Stage-out • Data transfer between activities • Using Trident Scientific Workflow Workbench
SLOSH on Windows Azure • Using Azure Worker Roles. Why? • VM Roles require management of multiple VM images for different applications and different configurations • Worker roles can be programmed to be generic to support different applications • Entails packaging SLOSH executable with dependencies as zip and uploading to Azure blob during pre-processing step • Worker roles programmed to download this .zip during initialization step • Worker roles execute SLOSH program based on how each middleware schedules tasks (details forthcoming)
Middleware models Job Scheduler Middleware Resource Abstraction Middleware
User interaction with middleware Resource Abstraction workflow Daytona workflow
Resource Abstraction Middleware-Sigiri • Abstraction for grids and clouds • Web service layer • Decouples job acceptance from execution and monitoring • Daemons • Manages compute resource interactions • Job submissions and monitoring • Cleaning up resources • Efficient allocation of resources • Persistence Layer between web service and daemons is through MySQL
Sigiri- Azure Daemon • For ensemble task, workflow client spawns request for each task (n requests for n tasks in ensemble run) • Sigiri web service stores tasks to its persistent Job queue • Sigiri Azure daemon picks up tasks and writes to Azure queue • Worker role instances poll Sigiri'sAzure queues for incoming execution requests • Upon receipt of SLOSH request, Azure worker role: • Executes SLOSH executable to produce output • Deletes message in Azure queue to prevent re-execution of request • Writes output to Azure queue notifying successful completion of task • Workflow issues request for data movement to/from Azure cloud; serves as Sigiri client requesting data movement
Job Scheduling Middleware- HPC Scheduler • Workflow client makes single request to Azure HPC scheduler on head node of cluster • Scheduler spawns n tasks based on request • SLOSH executed in parallel by compute nodes • Head node tracks tasks in ensemble • Job scheduler on head node is polled by workflow client to check status of job using REST API • SQL Azure used for persistence of state • Workflow nodes for data movement directly use Windows Azure Management API
Job Scheduling Middleware- Daytona • Workflow client uses MapReduce client to upload package with user-implemented map, reduce, data partitioning, and output writer tasks • Job submitted to master node • Controller on master delegates control topartitionerto split input data and then delegates task to map functions • User-Specified number of map tasks deployed in parallel, each executes SLOSH • Workflow client polls Master node to check completion status for SLOSH ensemble run, • Implemented as map-only • Workflow nodes for data movement directly use Windows Azure Management API
Strengths and weaknesses • Sigiri • Strengths • Extensible and decoupled architecture • Supports data movement, resource deployment and resource management • Weaknesses • Bottleneck in using Azure queue for Worker role communication; more scalable and distributed mechanism needs to be used • Additional communication costs as MySql persistent layer lies outside cloud; MySql could be bottleneck when querying/inserting large number of tasks • Azure HPC Scheduler • Strengths • Supports multiple scheduling algorithms (Priority based FCFS, Adaptive resource allocation, Backfilling) • Web portal for scheduling and monitoring • Weaknesses • Data Movement not supported
Strengths and weaknesses (cont.) • Daytona Map-reduce • Strengths • Suitable for reduce phase in SLOSH in creating output products (MEOW, MOM, shapefiles used in emergency response) • Data movement between map and reduce tasks supported • WCF (Windows Communication Foundation) for faster communication between worker roles, instead of using Azure queues for communication • Weaknesses • Data Movement across resources not supported • All middleware approaches use automatic persistence and replication provided by Azure storage for task-related data storage on cloud • All approaches required almost equal effort to get experimental SLOSH ensemble running
Experimental Evaluation • Test Environment: • Sigiri • Azure Daemon on Windows 7 Enterprise edition with Intel Core 2 Duo E8400 (2 core), 4GB of RAM • SigiriWeb Service on Mac OS X Lion with quad-core 2GHz Intel Core i7, 8GB of RAM • Azure queue to communicate between Azure Daemon for Sigiri and Worker Roles • SLOSH jobs executed on ’Small’ Windows Azure compute instances with Windows Server 2008 R2 as guest OS • Azure HPC scheduler • Cluster’s Head Node and Compute Nodes are ’Small’ Windows Azure compute instances with Windows Server 2008 R2 as the guest OS • SQL Azure and Windows Azure storage is used for communication • Daytona • Master and Slave nodes use same small compute instances • In all three cases, Windows Azure Tables used for persistent logging
Execution Model for Ensemble run Workflow Client Workflow Client 1 request 1…500 requests Sigiri Daytona/ Azure HPC Scheduler Uses Azure Table for storage. Worker roles pick up jobs and execute one at a time. 500 jobs added to Azure Queue Storage [20 … 100] [20 … 100] 0 0 Worker role dequeues and works on one job at a time Worker role dequeues and works on one job at a time
Performance Results • Visible drop in efficiency • At 60 and 80 worker roles, workers cannot be fully utilized (500 not evenly divisible by 60 or 80) • Lower parallel efficiency of Daytona due to scheduling mechanism used • We used static mapping of task-to-node for allocation • Daytona execution if configured to execute in round robin mechanism, shows similar parallel efficiency to other middleware solutions Parallel Efficiency • Parallel Efficiency = Tseq(N)/(T (N,P) ∗ P) • Tseq(N) execution time best sequential execution • T (N,P) total execution time N jobs on P nodes
Performance Results: Sigiri Latencies • Results suggest that implementing better job distribution and scheduling mechanisms can improve parallel efficiency of Sigiri
Conclusion • Results to date show that Sigiri, HPC Scheduler and Daytona are roughly comparable in parallel efficiency when executing SLOSH ensemble runs on Windows Azure • Workflow middleware examined for scheduling ensemble runs on Azure • HPC Scheduler and Daytona are efficient and cost-effective • Sigirihas an extensible architecture • Daytona reduce stage can be explored to create output products • MEOW, MOM, shapefiles used in emergency response
Related Work • SciCumulus executes pleasingly parallel tasks with CloudSim, a cloud simulator • Evaluates performance in cloud simulation environment for parametric sweep and data parallel applications • Middleware requires replacement of workflow components by SciCumulus components at the Desktop level, in order to use SciCumulus • Sigiri, HPC Scheduler and Daytona on the other hand provide a decoupled architecture for execution of parallel jobs by providing a web service interface. • Cloudbus toolkit • Middleware solution for using Amazon Cloud infrastructure from the Cloudbus workflow engine • Execute jobs on different cloud resources • Tightly coupled • Thilina et. al analyze map-reduce, Apache Hadoop and Microsoft DryadLINQframeworks for pleasingly parallel biomedical applications • We evaluate Daytona, a map-reduce runtime built on similar principles
Related Work contd. • Jaliya et al. discuss executing AzureBlast on Azure • Cirrus, a general parametric sweep service on Azure is used • Issues like fault-handling by reconfiguring and resuming jobs, and dynamic scaling addressed • However, Cirrus not available for evaluation • AureBlast , a parallel version of Blast was studied to determine the best practices for handling large scale parallelism and large volumes of data on Azure • Simmhan et. al identify the requirements for scientific workflows on cloud. • Provided us with guidance on using Windows Azure cloud for running scientific applications.
Ongoing Work • Distributed provenance and metadata collection • Adopt Twister4Azure instead of Daytona for continued exploration of suitability of MapReduce on Azure for this application (looking for volunteer) • Sigiri • Implementing multiple queues for faster communication between Sigiri Azure daemon and worker roles • Batch job submission support for performance and efficiency, and for SLOSH ensemble run which involve thousands of jobs per run • Move Sigiri Azure daemon to cloud • Uniform job distribution and scheduling mechanism for AzureDaemon – resource aware, stragglers • Aggregate jobs into single task which will be executed by worker nodes • Move workflow system to cloud
Ongoing work Desktop or Trident Saas? Metadata harvest