Yusuke Tanimura ， Hidetaka Koie, Tomohiro Kudoh Isao Kojima, and Yoshio Tanaka

The 11th ACM/IEEE International Conference on Grid Computing October 26-28, 2010, Brussels, Belgium A Distributed Storage System AllowingApplication Users to Reserve I/O Performancein Advance for Achieving SLA Yusuke Tanimura，Hidetaka Koie, Tomohiro Kudoh Isao Kojima, and Yoshio Tanaka National Institute of AIST, Japan

On Grids/Clouds • Importance of Service Level Agreement (SLA) • A contract between users and the service providers • End-to-end performance, reliability, and etc. • I/O performance of the storage system tends to be a critical bottleneck. • The bandwidth can be guaranteed by recent network technologies such as the lambda path. Performance guaranteed? Service Best effort BW Guaranteed BW Storage

Requirements Automatic translation and performance reservation Analyzed behaviors Broker I/O control technologies On-going Studies • QoS of parallel I/O in a distributed storage • Focused on scheduling and resource allocation Storage client Application I/O library Storage NW Local I/O scheduler However, resources are assigned to each application on a first-come (open request), first-serve basis. Storage servers

Our Approach • The storage system allows application users to reserve I/O performance in advance. • Explicit throughput (MB/sec) reservation • In advance reservation, there is a room to negotiate the contract. • Financial charge to a request • Features of our design and implementation: • A distributed storage which supports: • Advance performance reservation • User interfaces, protocols, resource allocation, etc. • Striping I/O with QoS according to the reservation • Integration of I/O control techniques • Cooperation with the network bandwidth reservation and the computing resource reservation

Assumptions and Definitions (1) • Assumed I/O workload (in our current focus) • Streaming type for a large amount of data • Not a mixture of read & write at a single access • Open for read-only, create or append-only • Space reservation • Cooperation with write performance reservation • Reserved space = “Bucket” • User’s private space • Name • Start and end time • Space size • Guaranteed throughput • Read • Write • Stored data = “Object” Time transition Bucket lifetime Object creation is allowed. Object lifetime

Assumptions and Definitions (2) • Performance reservation • Metrics shown to users: • Throughput (MB/sec) • Start and end time • Access type • Read for object • Write for bucket or object • Condition: • Space reservation or object creation should be prior to the performance reservation, and vice-versa for the cancellation. • Combined reservation • Support 1 space & N performance reservations at once • The storage resources are co-allocated so that the all reservations are accepted, or the request is rejected. Time transition Bucket lifetime Write reservation is allowed. Object lifetime Read and write (append) reservation are allowed.

(Network flow control) Overview Architecture Our proposed distributed storage system Network Resource Manager Management server (MGS) Storage Resource Manager Reservation management Metadata management for buckets and objects Web Services-based protocol Reserve request Allocate resources and administer I/O controls according to the reservation Collocation Global Resource Coordinator (Disk I/O rate control) Storage server (SS) OSD Web Services-based protocol Reserve request Storage server (SS) Web Services-based reservation client Applications OSD Commands Client API library Storage server (SS) OSD Client node

Reservation Interface • Command-line interface • Web-services interface (SRM interface) • A wrapped interface of command-line interface • Based on the GNS-WSI3 protocol • Polling-based asynchronous operation and two-phases commit • reserve/modify/release request ... (polling) ... commit/abort ... (polling) ... • We newly defined “ReservationResources_Type” for storage resources.

Client API Library • Features • Striping I/O over multiple storage servers • Use a fixed I/O size against storage servers • Conversion from the application’s I/O size • Non-POSIX API • Reservation ID must be specified in a create or open request for object. • Reservation ID is returned as a ticket when the performance reservation request is accepted. • The management server verifies the reservation by Reservation ID and User ID. create_bucket() delete_bucket() create_object() open_object() read() write() close()

Resource Management • Storage resources • Disk space & throughput of each OSD in a certain period of time • Role of MGS • Collect status information of the all OSDs • Max. throughput (Currently static) • Used/free space • Each OSD primarily manages its own disk space. • Allocate resources according to the reservation request • Record allocate/free information in the internal tables Management Server (MGS) Internal tables Reserve request from client • Access reservation info. • Space reservation info. (cache) Space reservation request before committing the allocation plan reply OSD Storage server (SS)

Each request usually has a time window, space size, and performance. Balancing space, balancing workload, or something else? Allocation strategy Change striping count and iterate this process Resource Allocation (1) Input (A set of reservation requests) 1. Check availability of each OSD - Estimate available space in a time window - Estimate available performance in a time window Performance model 2. Score each OSD and sort the list - Normalize and weight the availability Scoring model 3. Allocate a set of OSDs to the request - Assign OSDs from the list according to the score Allocation model - Check to ensure the assigned OSDs are not overused Performance model Output (A set of OSDs)

Resource Allocation (2) • Three models (performance, scoring and allocation models) should be customizable by storage administrators. • Our simple models in a prototype: • Read throughput is proportionally shared by multiple accesses. • E.g. Total 200MB/s by 2 process -> Each process can get 90 MB/s with 10% overhead. • Write access is always exclusive to any other accesses. • Balancing I/O workload is first. • The OSD which can provide higher throughput will be assigned first (a greedy strategy). • Free space is considered second. • Minimize striping count and limit the max. striping count • Striping size is fixed as a system-wide parameter.

I/O Rate Control Framework • The storage server controls I/O rate according to the MGS’s instruction. • Disk I/O scheduling • Under development • A storage network between client and storage servers • Integrate PSPacer into our prototype to configure the target network bandwidth on the Ethernet • The instruction is delivered using the capability model. 1. Open request with reservation ID Management Server (MGS) Client 2. Receive a capability Sharing the key 4. Verify the capability 3. Connect request OSD Storage server (SS) 5. Enforce rate control on this connection

Prototype Implementation • Papio: our developed distributed storage software • Implemented in C++ on Linux • Use SQLite version 3 for the internal database of MGS • Use EBOFS (an extent and B+tree based object file system) as our OSD base • Extend the allocation algorithm to support space reservation • Use PSPacer for network bandwidth control • Support the simple models for resource allocation • SRM: our developed reservation agent for Papio, providing Web-Services interface (SRM interface) • Implemented in Java • Use GridARS to support the GNS-WSI3 protocol

Evaluation • Reservation cost • Comparison between commad-line and SRM interfaces • Overheads of SRM and Papio • Performance of reserved v.s. non-reserved access • A single occupation strategy • A multiple occupation strategy • Experiment environment • 6 machines below connected by Dell PowerConnect 6248

Reservation Cost (1) • We had 4 experiment cases. Node-2 Node-1 Dummy MGS cmds a) Web Services-based reservation client Storage Resource Manager (SRM) Node-2 Node-1 b) MGS cmds Web Services-based reservation client Storage Resource Manager (SRM) MGS Node-1 c) MGS cmds MGS Node-1 Node-2 d) MGS cmds MGS

Reservation Cost (2) • In the result, the SRM interface was 3~4 times slower than command-line’s because of the polling (100 msec interval) based operation. • The cost is reasonably low and might not be a bottleneck.

Reserved / non-reserved access (1) • Measured Client-A’s read access: • Reserved: Papio applies a single occupation strategy that each OSD serves only one access. Reserved Client-A Client-B Client-A Client-B I/O control is not applied. SS SS SS SS SS SS OSD OSD Client-A: 1 stream Client-A: Striping Non-reserved Client-A Client-B Client-A Client-B Conflict with Client-B’s read or write access SS SS SS SS OSD OSD Client-A: 1 stream Client-A: Striping

Reserved / non-reserved access (2) • Non-reserved access affected by Client-B’s access. 1 stream Striping - Reserved ■ Non-reserved: R-R X Non-reserved： R-W 55MB/s 55MB/s x3 Client-A’s read throughput [MB/sec]

Reserved / non-reserved access (3) • Measured Client-A’s read access: • Reserved: Papio applies a multiple occupation strategy that each OSD serves more than one access. Reserved Client-A Client-B Client-A Client-B 80MB/s x3 I/O control by PSPacer is applied. 10% overhead (protocol etc.) estimation 20MB/s 80MB/s 20MB/s SS SS SS SS OSD OSD Client-A: 1 stream Client-A: Striping Non-reserved Client-A Client-B Client-A Client-B Conflict with Client-B’s read or write access SS SS SS SS OSD OSD Client-A: 1 stream Client-A: Striping

Reserved / non-reserved access (4) • Reserved access got the requested I/O throughput. 1 stream Striping - Single occupation ■ Non-reserved: R-R ▲ Reserved: controlled 80MB/s x3 80MB/s Client-A’s read throughput [MB/sec]

Potential Applications • Constraints • Require an advance reservation • Read and append-only access for a large amount of data • Potential applications (scheduled execution?) • Multimedia streaming (We had a demo in August.) • Moving large data between data centers • Server provisioning VOD service provider Coordinate & reserve resources Watch reservation x NRM x x Streaming server x SRM x x x Streaming server x Streaming server Optical path network Papio storage

Related Work • SRM in OGF • SLA features: retention policy, access latency • Automatic configuration to satisfy given I/O workload • Hippodrome, MINERVA • Resource allocation based on performance prediction • Many existing works for QoS • Disk I/O scheduling • Network QoS • Performance monitoring and feedbacked I/O control We would like to apply some of these techniques to Papio and achieves more fine-grained performance guarantee.

Conclusion and Future Work • Proposed “an advance reservation feature by application users” for storage access. • A different model from that resources are allocated at time of creating/opening files (on-demand) • Design • Defined performance metrics and storage resources • Four key components: • Reservation interface • Client API • Resource management framework • I/O control framework • Implemented Papio and SRM as a prototype and evaluated the basic performance and functions. • Providing a more sophisticated user interface and a “guarantee” mechanism are in our future work.

Acknowledgement • A part of this work was supported by Special Coordination Funds for Promoting Science and Technology of the Japanese Ministry of Education, Culture, Sports, Science and Technology.

Yusuke Tanimura ， Hidetaka Koie, Tomohiro Kudoh Isao Kojima, and Yoshio Tanaka

Yusuke Tanimura ， Hidetaka Koie, Tomohiro Kudoh Isao Kojima, and Yoshio Tanaka

Presentation Transcript

A Case Study of Application of TBLT to Teacher Training

Introduction to Electro Meridian Analysis System (EMAS)

Randomness Leakage in the KEM/DEM Framework

Tunneling Conductance and Surface States Transition in Superconducting Topological Insulators

Physical Fluctuomatics 7th~10th Belief propagation

Preparing for the On-site Evaluation: Strategies for Success November 5, 2009

Polarized Proton Solid Target at high-T and low-B

Programming on the Grid using GridRPC

Muon detector

The Swedish labour market: