StarFish : highly-available block storage

StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster FengruiGu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003 USENIX Annual Technical Conference Presenter: D00922019 林敬棋

Introduction • Important data need to be protected. • Making replicas. • Replication on remote sites • Reduce the amount of data lost in failure. • Decrease the time required to recover from catastrophic site failure.

StarFish • A highly-available geographically-dispersed block storage system. • Does not require expensive dedicated communication lines to all replicas to achieve highly-available . • Achieves good performance even during recovery from a replica failure. • Single-owner access semantics.

Architecture • StarFish consists of • One Host Element(HE) • Provides storage virtualization and read cache. • N Storage Element(SE) • Q: write quorum size. • Synchronous updates to a quorum of Q SEs, and asynchronous updates to the rest.

Recommended Setup N = 3, Q = 2 MAN : Metropolitan Area Network WAN :Wide Area Network

Another Deployment

SE Recovery • Write log • HE keeps a circular buffer of recent writes. • Each SE maintains a circular buffer of recent writes on a log disk. • Three types of recovery • Quick recovery • Replay recovery • Full recovery

Availability and Reliability • Assume that the failure and recovery processes of the network links and SEs are i.i.d Poisson processes with combined mean failure and recovery rates of λ and μ per second. • Similarly, the HE has Poisson-distributedλhe and μhe .

Availability • The steady-state probability that at least Q SEs are available. • Derived from the standard machine repairman mode.

Machine Repairman Model

Availability(cont.)

Availability(cont.) • X★9:the number of 9s in an availability measure. • Achieve a much higher availability when N = 2Q + 1. • For fixed N, availability decrease with larger quorum size. • Increasing quorum size trades off availability for reliability.

Reliability • The probability of no data loss. • The reliability increases with larger Q. • Two approaches • Make Q > floor(N/2) and at least Q SEs are available. • Reduce availability and performance. • Read-only consistency

Read-only Consistency • Available in read-only mode during failure. • Read-only mode obviates the need for Q SEs to be available to handle updates. • Increase availability

Availability with Read-only Consistency

Observations • If ρhe = 0, availability is independent of Q. • Can always recover from HE. • If ρhe increase, availability increase with Q. • Largest increase occurs from Q = 1 to Q = 2, and bounded by 3/16 when ρ = 1. • Diminishing gain after Q = 2. • Suggest Q = 2 in practical system.

Implementation

Performance Measurements • Compares with a direct-attached RAID unit.

Settings • Different network delays • 1, 2, 4, 8, 23, 36, 65 ms • Different bandwidth limitations • 31, 51, 62, 93, 124 Mb/s. • Benchmark: • Micro-benchmark • Read hit • Read miss • Write • PostMark

Effects of network delays and HE cache size • Near SE delay: 4ms; Far SE delay: 8ms • No cache miss if HE cache size = 400MB

Observation • Large HE cache improves performance. • HE can respond to more read requests without communicating with SE. • Does not change write requests. • Especially beneficial when local SE has significant delays. • Q = 2 and 400MB cache size is not influenced by the delay to local SE. • Depend on near SE.

Normal Operation and placement of the far SE • 1-8: 1, 2, 4, 8 ms; 4-12: 4, 8, 12 ms • 23-65: 23, 36, 65 ms; 31-124: 31,51,62,93,124 Mbps • Local SE delay: 0ms N = 3

Normal Operation and placement of the far SE(Cont.) N = 3 8 threads

Normal Operation and placement of the far SE(Cont.)

Observation • Performance is influenced mostly by two parameters • Write quorum size • Delay to the SE. • StarFish can provide adequate performance when one of the SEs is placed in a remote location. • At least 85% of the performance of a direct-attached RAID.

Recovery • Performance degrades more during full recovery.

Conclusion • The StarFish system reveals significant benefits from a third copy of the data at an intermediate distance. • A StarFish system with 3 replicas, a write quorum size of 2, and read-only consistency yields better than 99.9999% availability assuming individual Storage Element availability of 99%.

StarFish : highly-available block storage