Network Weather Service: Resource Forecasting for Metacomputing

The Network Weather Service A Distributed Resource Performance Forecasting Service for Metacomputing Rich Wolski, Neil T. Spring and Jim Hayes Presented By: Mohammad Al-Saeed

Organization • Introduction • Motivation: why the NWS? • The NWS: what is the NWS? • Related work • NWS system architecture • Design goals • System components • NWS components • NWS interface • Conclusion and future work

Motivation • Searching for the environment that delivers the most • Dynamic nature of metacomputing environments • Adaptive applications • Adapt to changing environments • Knowledge needed for adaptation • Resource discovery and allocation

The Network Weather Service • A distributed system for producing short-term deliverable performance forecasts • Goal: dynamically measure and forecast the performance deliverable at the application level from a set of network resources • Measurements currently supported: • Available fraction of CPU time • End-to-end TCP connection time • End-to-end TCP network latency • End-to-end TCP network bandwidth

Related Work • TReno: performance at transport layer using TCP • Pathchar: bandwidth over a path • bprobe/cprobe: bottleneck link speed and competing traffic • Topology-d: uses ping and netperf to find bandwidth between hosts in a group then analyzes this data to find minimum-cost logical topology • ReMoS: network resource monitoring

NWS System Architecture • Design objectives • Scalability: scales to any metacomputing infrastructure • Predictive accuracy: provides accurate measurements and forecasts • Non-intrusiveness: shouldn’t load the resources it monitors • Execution longevity: available all time • Ubiquity: accessible from everywhere, monitors all resources

System Components • Four different component processes • Persistent State process: handles storage of measurements • Name Server process: directory server for the system • Sensor processes: measure current performance of different resources • Forecaster process: predicts deliverable performance of a resource during a given time

NWS Processes

NWS Components • Persistent State Management • Naming Server • Performance Monitoring: NWS Sensors • CPU Sensor • Network Sensor • Sensor Control • Cliques: hierarchy and contention • Adaptive time-out discovery • Forecasting • Forecaster and forecasting models • Sample forecaster results

Persistent State Management • All NWS processes are stateless • The system state (measurements) are managed by the PS process: • Storage & retrieval of measurements • Measurements are time-stamped plain-text strings • Measurements are written to disk immediately and acknowledged • Measurements are stored in a circular queue of tunable size

Naming Server • Primitive text string directory service for the NWS system • The only component known system-wide • Information stored include • Name to IP binding information • Group configuration • Parameters for various processes • Each process must refresh its registration with the name server periodically • Centralized

Performance Monitoring • Actual monitoring is performed by a set of sensors • Accuracy vs. Intrusiveness • A sensor’s life: { Register with the NS; Query the NS for parameters; Generate conditional test; Forever { if conditions are met then { perform test; time-stamp results and send them to the PS refresh registration with the NS } }

CPU Sensor • Measures available CPU fraction • Testing tools: • Unix uptime: reports load average in the past x minutes • Unix vmstat: reports idle-, user- and system-time • Active probes • Accuracy: • Results assume a full priority job • Doesn’t know the priority of jobs in the queue

Active Probing Improvements Measurements produced using vmstat Measurements produced using uptime

Network Sensor • Carries network-related measurements • Testing: using active network probes • Establish and release TCP connections • Moving large (small) data to measure bandwidth (delay) • Measures connections with all peer sensors • Problems • Accuracy: depends on socket interface • Complexity: N2-N tests, collisions (contention)

Network Sensor Control • Sensors are organized into sensor sets called cliques • Each clique is configurable and has one leader • Clique sets are logical, but can be based on physical topology • Leaders are elected using a distributed election protocol • A sensor can participate in many cliques • Advantages • Scalability by organizing cliques in a hierarchy • Reduce the N2-N • Accuracy by more frequent tests

National UCSD UTenn PCL SDSC Clique Hierarchy

Contention • Each leader maintains a clique token (and time between tokens) • The sensor that has the token performs all its tests then passes the token to the next sensor in the list • Adaptive time-out discovery • Tokens have time-out field • Tokens have sequence numbers • The leader adaptively controls the time-out

Forecaster Process • A forecasting driver and a set of compile-time prediction modules • Forecasting process: • Fetching required measurements from the PS • Passing the time series to each prediction module • Choosing the best returned prediction • Incorporate sophisticated prediction techniques?

Sample Forecaster Results UC Santa Barbara – Kansas State U. Recorded Bandwidth UC Santa Barbara – Kansas State U. Forecasted Bandwidth

NWS Interface • C API • Quick short-term forecasts for applications • InitForecaster() • RequestForecasts() • CGI interface • Continuous access to NWS forecasts through the web • Interactively produces graphs for performance and forecasts • http://nws.cs.utk.edu

Sample CGI-Generated Graph

Conclusion and Future Work • NWS is scalable, stable and always available • NWS relies on adaptivity to achieve its design goals • NWS is open (adding sensors and forecasting models) • Current forecasting is excellent compared to powerful sophisticated forecasting techniques • Enhancements • Basing the NS on LDAP • Automatic clique configuration • Forecasting methodologies

Network Weather Service: Resource Forecasting for Metacomputing

Network Weather Service: Resource Forecasting for Metacomputing

Presentation Transcript

National Weather Service

Network Weather Service

National Weather Service

National Weather Service

National weather service

National Weather Service

National Weather Service

National Weather Service

National Weather Service

National Weather Service

National Weather Service

National Weather Service

The National Weather Service

National Weather Service

The National Weather Service

National Weather Service

National Weather Service

National Weather Service

The National Weather Service

National Weather Service

National Weather Service