Availability in Globally Distributed Storage Systems

Availability in Globally Distributed Storage Systems Derek Weitzel

Failures in the System • Two major components in a Node Applications System

Failures in the System Nebraska Google Bigtable Cluster Scheduler Application GFS Hadoop File Systems File Systems System Hard Drive Hard Drive

Failures in the System • Similar systems at Nebraska Nebraska Google Bigtable Cluster Scheduler Application GFS Hadoop File Systems File Systems System Hard Drive Hard Drive

Failures in the System • Similar systems at Nebraska Nebraska Google Bigtable Cluster Scheduler Application GFS Hadoop File Systems File Systems System Hard Drive Hard Drive Failure will cause unavailability

Failures in the System • Similar systems at Nebraska Nebraska Google Bigtable Cluster Scheduler Application GFS Hadoop File Systems File Systems Could cause data loss System Hard Drive Hard Drive Failure will cause unavailability

Unavailability: Defined • Data on a node is unreachable • Detection: • Periodic heartbeats are missing • Correction: • Lasts until node comes back • System recreates the data

Unavailability: Measured

Unavailability: Measured Replication Starts

Unavailability: Measured Question: After replication starts, why does it take so long to recover? Replication Starts

Node Availability Storage Software Restart

Node Availability Storage Software Restart Software is fast to restart

Node Availability: Time Planned Reboots

Node Availability: Time Node updates (planned reboots) cause the most downtime. Planned Reboots

MTTF for Components • Even though Disk failure can cause data loss, node failure is much more often • Conclusion: Node failure is more important to system availability

Correlated Failures • Large number of nodes failing in a burst can reduce effectiveness of replication and encoding schemes • Losing nodes before replication can start can cause unavailability of data

Correlated Failures

Correlated Failures Rolling Reboots of cluster

Correlated Failures Oh s*!t, datacenter on fire! (maybe not that bad)

Coping with Failure

Coping with Failure Encoding Replication

Coping with Failure Encoding Replication 27,000 Years 27.3 M Years 3 replicas is standard in large clusters

Coping with Failure Cell Replication (Datacenter Replication)

Cell Replication Cell 1 Cell 2 Block A Block A Block A Block A

Modeling Failures We’ve seen the data, now lets model the behavior.

Modeling Failures • A chunk of data can be in one of many states. • Consider when Replication = 3 3 2 1 0 Lose a replica, but still 2 available

Modeling Failures • A chunk of data can be in one of many states. • Consider when Replication = 3 Recovery 3 2 1 0 0 replicas = service unavailable

Modeling Failures • Each loss of a replica has a probability • The recovery rate is also known Recovery 3 2 1 0 0 replicas = service unavailable

Markov Model ρ= recovery λ= failure rate s = block replications r = minimum replication

Modeling Failures • Using Markov models, we can find:

Modeling Failures • Using Markov models, we can find: 402 Years Nebraska

Modeling Failures • For Multi-Cell Implementations

Paper Conclusions • Given enormous amount of data from Google, can say: • Failures are typically short • Node failures can happen in bursts, and are not independent • In modern distributed file systems, disk failure is the same as node failure. • Built Markov Model for failures that accurately reason about past and future availability.

My Conclusions • This paper contributed greatly by showing data from very large scale distributed file systems. • If Reed – Solomon striping is so much more efficient, why isn’t it used by Google? Hadoop? Facebook? • Complicated code? • Complicated administration?

Availability in Globally Distributed Storage Systems