Neptune: Scalable Replication Management and Programming Support for Cluster-based Network Services

Neptune: Scalable Replication Management and Programming Support for Cluster-based Network Services Kai Shen, Tao Yang, Lingkun Chu, JoAnne L. Holliday, Douglas K. Kuschner, and Huican Zhu Department of Computer Science University of California, Santa Barbara http://www.cs.ucsb.edu/research/Neptune

Motivations • Availability, incremental-scalability, and manageability - key requirements for building large-scale network services. • Challenging for those with frequent persistent data updates. • Existing solutions in managing persistent data: • Pure data partitioning: no availability guarantee; bad at dealing with runtime hot-spots. • Disk-sharing: inherently unscalable; single-point of failure. • Replication provided by database vendors: tied to specific database systems; inflexible in consistency. USITS 2001, San Francisco

Neptune Project Goal • Design a scalable clustering architecture for aggregating and replicating network services with persistent data. • Provide a simple and flexible programming model to shield complexity of data replication, service discovery, load balancing, and failover management. • Provide flexible replica consistency support to address availability and performance tradeoffs for different services. USITS 2001, San Francisco

Related Work • TACC, MultiSpace: infrastructure support for cluster-based network services. • DDS: distributed persistent data structure for network services. • Porcupine: cluster-based email service (with commutative updates). • Bayou: weak consistency for wide-area applications. • BEA Tuxedo– platform middleware supporting transactional RPC. USITS 2001, San Francisco

Outline • Motivations & Related Work • System Architecture and Assumptions • Replica Consistency and Failure Recovery • System implementation and Service Deployments • Experimental Studies USITS 2001, San Francisco

Partitionable Network Services Characteristics of network services: • Information independence. Service data can be divided into independent categories (e.g. discussion group). • User independence. Data accessed by different users tend to be independent (e.g. email service). Neptune is targeting partitionable network services: • Service data can be divided into independent partitions. • Each service access can be delivered independently on a single partition; or • Each service access can be aggregated from sub-services each of which can be delivered independently on a single partition. USITS 2001, San Francisco

Conceptual Architecture for a Neptune Service Cluster USITS 2001, San Francisco

Neptune Components Neptune components on client and server-side: • Neptune Server Module: starts, regulates, terminates registered service instances and maintains replica data consistency. • Neptune Client Module: provides location-transparent accesses to application service clients. USITS 2001, San Francisco

Programming Interfaces Request/Response communications: • Client-side API: (called by service clients) NeptuneCall (CltHandle, Service, Partition, SvcMethod, Request, Response); • Service Interface: (abstract interface that application services implement) SvcMethod (SvcHandle, Partition, Request, Response); Stream-based communications: • Neptune sets up a bi-directional stream between the service client and the service instance. USITS 2001, San Francisco

Assumptions • All system modules follow fail-stop failure model. • Network partitions do not occur inside the service cluster. Neptune does allow persistent data survive all-node failures. • Atomic execution is supported if each underlying service module ensures atomicity in stand-alone configuration. USITS 2001, San Francisco

Neptune Replica Consistency Model A service access is called a write if it changes the state of persistent data; and it is called a read otherwise. • Level 1: Write-anywhere replication for commutative writes. Writes are accepted at any replica and propagated to peers. E.g. message board (append-only). • Level 2: Primary-secondary replication for ordered writes. Writes are only accepted at primary node, then ordered and propagated to secondaries. • Level 3: Primary-secondary replication with staleness control.Soft time-based staleness bound and progressive version delivery. Not strong consistency because writes completed independently at each replica. USITS 2001, San Francisco

Soft Time-based Staleness Bound • Semantics: each read serviced at a replica at most x seconds stale compared to the primary. • Important for services such as on-line auction. • Implementation: • Each replica periodically announces its data version; • Neptune client module directs requests only to replicas with a fresh enough version. • The bound is soft, depending on network latency, announcement frequency, and intermittent packet losses. USITS 2001, San Francisco

Progressive Version Delivery • From each client’s point of view, • Writes are always seen by subsequent reads. • Versions delivered for reads are progressive. • Important for services like on-line auction. • Implementation: • Each replica periodically announces its data version; • Each service invocation returns a version number for a service client to keep as a session variable; • Neptune client module directs a read to a replica with an announced version >= all the previously-returned version. USITS 2001, San Francisco

Failure Recovery A REDO log is maintained for each data partition at each replica, which has two portions: • Committed portion: completed writes; • Uncommitted portion: writes received but not yet completed. Three-phrase recovery for primary-secondary replication (level-2 & level-3): • Synchronize with underlying service module; • Recover missed writes from the current primary; • Resume normal operations. Only phase one is necessary for write-anywhere replication (level-1). USITS 2001, San Francisco

Outline • Motivations & Related Work • System Architecture and Assumptions • Replica Consistency and Failure Recovery • System Implementation and Service Deployments • Experimental Studies USITS 2001, San Francisco

Prototype System Implementation on a Linux cluster • Service availability and node runtime workload are announced through IP Multicast. • multicast once a second; • kept as soft state, expires in five seconds. • Service instances can run either as processes or threads in Neptune server runtime environment. • Each Neptune server module maintains a process/thread pool and a waiting queue. USITS 2001, San Francisco

Experience with Service Deployments • On-line discussion group • View message headers, view message, and add message. • All three consistency levels can be applied. • Auction • Level 3 consistency with staleness control is used. • Persistent cache • Store key-value pairs (e.g. caching query result). • Level 2 consistency (Primary-secondary) is used.  Fast prototyping and implementation without worrying about replication/clustering complexities. USITS 2001, San Francisco

Experimental Settings for Performance Evaluation • Synthetic Workloads: • 10% and 50% write percentages; • Balanced workload to assess best-case scalability; • Skewed workload to evaluate the impact of runtime hotspots. • Metric: maximum throughput when at least 98% client requests are completed in 2 seconds. • Evaluation Environment: • Linux cluster with dual 400MHz Pentium IIs, 512MB/1GB memory, dual 100Mb/s Ethernet interfaces. • Lucent P550 Ethernet switch with 22Gb/s backplane bandwidth. USITS 2001, San Francisco

Scalability under Balanced Workload • NoRep is about twice as fast as Rep=4 under 50% writes. • Insignificant performance difference across three consistency levels under balanced workload. USITS 2001, San Francisco

Skewed Workload • Each skewed workload consists of requests chosen from a set of partitions according to Zipf distribution. • Define the workload imbalance factor as the proportion of the requests directed to the most popular partition. • For a 16-partition service, an imbalance factor of 1/16 indicates a completely balanced workload. • An imbalance factor of 1 means all requests are directed to one partition. USITS 2001, San Francisco

Impact of Workload Imbalance on Replication Degrees • Replication provides dynamic load-sharing for runtime hot-spots (Rep=4 could be up to 3 times as fast as NoRep). 10% writes; level-2 consistency; 8 nodes. USITS 2001, San Francisco

Impact of Workload Imbalance on Consistency Levels 10% writes; Rep degree 4; 8 nodes. • Modest performance difference: • Up to 12% between level-2 and level-3; • Up to 9% between level-1 and level-2. USITS 2001, San Francisco

Failure Recovery for Primary-secondary Replication • Graceful performance degradation. • Performance drop after the three-node failure. • Errors and timeouts trailing each recovery (write recovery and sync overhead). USITS 2001, San Francisco

Conclusions Contributions: • Scalable replication for cluster-based network services; multi-level consistency with staleness control. • A simple programming model to shield replication and clustering complexities from application service authors. Evaluation results: • Replication improves performance for runtime hotspots. • Performance of level 3 consistency is competitive. • Level 2/3 carries extra overhead during failure recovery. http://www.cs.ucsb.edu/research/Neptune USITS 2001, San Francisco

Neptune: Scalable Replication Management and Programming Support for Cluster-based Network Services