Availability in Globally Distributed Storage Systems
This paper explores the critical failures that can affect availability in globally distributed storage systems, such as Google Bigtable and Hadoop. It highlights two major components at a node, their failure risks, and the impact of hard drive failures. The study emphasizes node failure over disk failure in contributing to unavailability and models recovery dynamics using Markov processes. It also addresses correlated failures, planned reboots, and the effectiveness of replication schemes. Robust methods for coping with system failures are discussed to enhance data reliability in vast distributed environments.
Availability in Globally Distributed Storage Systems
E N D
Presentation Transcript
Availability in Globally Distributed Storage Systems Derek Weitzel
Failures in the System • Two major components in a Node Applications System
Failures in the System Nebraska Google Bigtable Cluster Scheduler Application GFS Hadoop File Systems File Systems System Hard Drive Hard Drive
Failures in the System • Similar systems at Nebraska Nebraska Google Bigtable Cluster Scheduler Application GFS Hadoop File Systems File Systems System Hard Drive Hard Drive
Failures in the System • Similar systems at Nebraska Nebraska Google Bigtable Cluster Scheduler Application GFS Hadoop File Systems File Systems System Hard Drive Hard Drive Failure will cause unavailability
Failures in the System • Similar systems at Nebraska Nebraska Google Bigtable Cluster Scheduler Application GFS Hadoop File Systems File Systems Could cause data loss System Hard Drive Hard Drive Failure will cause unavailability
Unavailability: Defined • Data on a node is unreachable • Detection: • Periodic heartbeats are missing • Correction: • Lasts until node comes back • System recreates the data
Unavailability: Measured Replication Starts
Unavailability: Measured Question: After replication starts, why does it take so long to recover? Replication Starts
Node Availability Storage Software Restart
Node Availability Storage Software Restart Software is fast to restart
Node Availability: Time Planned Reboots
Node Availability: Time Node updates (planned reboots) cause the most downtime. Planned Reboots
MTTF for Components • Even though Disk failure can cause data loss, node failure is much more often • Conclusion: Node failure is more important to system availability
Correlated Failures • Large number of nodes failing in a burst can reduce effectiveness of replication and encoding schemes • Losing nodes before replication can start can cause unavailability of data
Correlated Failures Rolling Reboots of cluster
Correlated Failures Oh s*!t, datacenter on fire! (maybe not that bad)
Coping with Failure Encoding Replication
Coping with Failure Encoding Replication 27,000 Years 27.3 M Years 3 replicas is standard in large clusters
Coping with Failure Cell Replication (Datacenter Replication)
Cell Replication Cell 1 Cell 2 Block A Block A Block A Block A
Cell Replication Cell 1 Cell 2 Block A Block A Block A Block A
Cell Replication Cell 1 Cell 2 Block A Block A Block A Block A
Cell Replication Cell 1 Cell 2 Block A Block A Block A Block A
Modeling Failures We’ve seen the data, now lets model the behavior.
Modeling Failures • A chunk of data can be in one of many states. • Consider when Replication = 3 3 2 1 0 Lose a replica, but still 2 available
Modeling Failures • A chunk of data can be in one of many states. • Consider when Replication = 3 Recovery 3 2 1 0 0 replicas = service unavailable
Modeling Failures • Each loss of a replica has a probability • The recovery rate is also known Recovery 3 2 1 0 0 replicas = service unavailable
Markov Model ρ= recovery λ= failure rate s = block replications r = minimum replication
Modeling Failures • Using Markov models, we can find:
Modeling Failures • Using Markov models, we can find: 402 Years Nebraska
Modeling Failures • For Multi-Cell Implementations
Paper Conclusions • Given enormous amount of data from Google, can say: • Failures are typically short • Node failures can happen in bursts, and are not independent • In modern distributed file systems, disk failure is the same as node failure. • Built Markov Model for failures that accurately reason about past and future availability.
My Conclusions • This paper contributed greatly by showing data from very large scale distributed file systems. • If Reed – Solomon striping is so much more efficient, why isn’t it used by Google? Hadoop? Facebook? • Complicated code? • Complicated administration?