Decoupled Storage: “Free the Replicas!”

Decoupled Storage: “Free the Replicas!” Andy Huang and Armando Fox Stanford University

What is decoupled storage (DeStor)? • Goal: application-level persistent storage system for Internet services • Good recovery behavior • Predictable performance • Related projects • Decoupled version of DDS (Gribble) • Federated Array of Bricks (HP Labs) at the application level • Session State Server (Ling), but for persistent state

Outline • Dangers of coupling • Techniques for decoupling • Consequences

ROWA – coupling and recovery don’t mix • Read One (i.e., any) • All copies must be consistent • Availability coupling: data locked during recovery to bring replica up-to-date • Write All • Writes proceed at the rate of the slowest replica • Performance coupling:system can grind to a halt if one replica degrades • Possible causes of degradation: cache warming and garbage collection

Decoupled ROWA – allow replicas to say “No” • Write All (but replicas can say “No”) • Performance coupling: write can complete without waiting for a degraded replica • Availability coupling: allowing stale values eliminates the need for locked data during recovery • Issue: read may return a stale value • Read One (but read all timestamps) • Replicas can say “No” to a read_timestamp request • Use quorums to make sure enough replicas say “Yes”

Quorums – use up-to-date information • Quorums 101 • Perform reads and writes on a majority of the replicas • Use timestamps to determine the correct value of a read • Performance coupling • Problem: requests distributed using static information • Consequence: one degraded node can slow down over 50% of writes • Load-balanced quorums • Use current load information to select quorum participants

DeStor – two ways to look at it • Decoupled ROWA • “Write all” is best-effort, but write to at least a majority • Read majority of timestamps to check staleness • Load-balanced quorums (w/ read optimization) • Use dynamic load information • Read one value and majority of timestamps

DeStor write Write Issue write(key,val) to N replicas Wait for majority to ack before returning success Else, timeout and retry or return fail R1 v.6 write v.7 v.5 R2 C write v.7 v.6 R3 v.6 R4 R1 v.7 v.7 R2 C success v.6 R3 v.7 R4

DeStor read Read Issue {v,tv}=read(key) to random replica Issue get_timestamp(key) to N replicas Find most recent timestamp t*T={t1,t2,…} If (tv=t*), return v Else, issue read(key) to replica with tn=t* R1 v.7 read v.7 R2 C read time v.6 R3 v.7 R4 R1 v.7 value,v.7 v.7 R2 C value v.6 v.6 R3 v.7 v.7 R4

Decoupling further – unlock the data x=1 C1 DeStor C2 R1 x=1 C1 x=1 R2 w v.7 r x=2 x=2 R3 C2 v.6 x=2 R4 x=1 R1 (1,0) C1 (1,2) R2 y=2 (1,2) R3 C2 (0,2) R4 Client-generated physical timestamps API: Single-operation transactions with no partial updates Assumption: clients operate independently • 2-phase commit – ensures atomicity among replicas • Couples replicas between phases • Locking complicates the implementation and recovery • 2PC not needed for DeStor?

Client failure – what can happen w/o locks R1 v.7 R2 v.6 C1 R3 v.6 R4 v.6 R1 v.7 R2 v.7 C2 R3 v.6 R4 v.7 • Issue • Less than majority are written • R2 and R3  v.6 • R1 and R2/R3  v.7 • Serializability • Once v.7 is read, make sure it is the majority • Idea: write v.7 didn’t happen until it was read

Timestamps – loose synchronization is sufficient • Unsynchronized clocks • Issue: client’s writes are “lost” because other writers’ timestamps are always more recent • Why that’s okay: clients are independent, so they can’t differentiate a “lost write” from an overwritten value • Caveat: a user is often behind the client requests • User sees inter-request causality • NTP synchronizes clocks within milliseconds, which is sufficient for human-speed interactions

Consequence – behavior is more restricted • Good recovery behavior • Data available throughout crash and recovery • Performance degradation during cache warming doesn’t affect other replicas • Predictable performance • DeStor vs. ROWA: DeStor has better write throughput and latency at the cost of read throughput and latency • Key: better degradation characteristics  more predictable performance

Performance: predictable Twrite T1= throughput of a single replica D1= % degradation of one replicaD = % system degradation = [−slope/Tmax]D1 ROWA:Tmax= T1slope = -T1D = D1 DeStor:Tmax= (N/Q)T1 T1 ≤ Tmax ≤ 2T1 slope = −T1/Q = −2T1/(N+1)D = D1/N T N=7 N=5 T1 N=3 D1 0 1

Performance: slightly degraded Tread ROWA:Tmax = NT1slope = −T1D = D1/N DeStor: depends on overhead of read_timeout requestTmax = NT1 – (N/Q)[overhead]slope = –T1 – (T1/Q)[overhead]D ≈ D1/N T NT1 (N-1)T1 T2 T1 D1 0 1

Research issues – once replicas are free… • Next step: simulate ROWA and DeStor • Measure: read and write throughput/latency • Factors: object size, working set size, read-write mix • Opens up new options for system administration • Online repartitioning, scaling, and replica replacement • Raises new issues for performance optimizations • When in-memory replication is persistent enough (non-write-through replicas)

Summary • Application-level persistent storage system • Replication scheme • Write all, wait for majority • Read any, read majority of timestamps • Consequences • Data availability throughout recovery • Predictable performance when replicas degrade

Decoupled Storage: “Free the Replicas!”

Decoupled Storage: “Free the Replicas!”

Presentation Transcript

Inside the Box

Secondary Storage

Magnetic Storage Principles

Inside Windows Azure Storage : what's new and under the hood deep dive

The Windows Storage Driver Stack In Depth Storport And The Future Of Windows Storage

Material Handling and Storage System

Version 11.70 Overview

Les Offres Windows Storage Server de A à Z

Materials Handling, Storage, Use and Disposal

IBM System Storage N series Overview

Crash Recovery

Blob Storage

Databasesystemer

ASEAN GMP TRAINING MODULE STORAGE

STORAGE AND I/O

Storage and File Structure

Storage

Chapter 1: Data Storage

Chapter 11: Storage and File Structure