1 / 27

Fast Crash Recovery in RAMCloud

Fast Crash Recovery in RAMCloud. Motivation. The role of DRAM has been increasing Facebook used 150TB of DRAM For 200TB of disk storage However, there are limitations DRAM is typically used as cache Need to worry about consistency and cache misses. RAMCloud.

hedy
Télécharger la présentation

Fast Crash Recovery in RAMCloud

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast Crash Recovery in RAMCloud

  2. Motivation • The role of DRAM has been increasing • Facebook used 150TB of DRAM • For 200TB of disk storage • However, there are limitations • DRAM is typically used as cache • Need to worry about consistency and cache misses

  3. RAMCloud • Keeps all data in RAM at all times • Designed to scale to thousands of servers • To host terabytes of data • Provides low-latency (5-10 µs) for small reads • Design goals • High durability and availability • Without compromising performance

  4. Alternatives • 3x RAM replications • 3x cost and energy • Power failure • RAMCloud keeps one copy in RAM • Two copies on disks • To achieve good availability • Fast crash recovery (64GB in 1-2 seconds)

  5. RAMCloud Basics • Thousands of off-the-shelf servers • Each with 64GB of RAM • With InfinibandNICs • Remote access below 10 µs

  6. Data Model • Key-value store • Tables of objects • Object • 64-bit ID + 1MB array + 64-bit version number • No atomic updates to multiple objects

  7. System Structure • A large number of storage servers • Each server hosts • A master, which manages local DRAM objects and service requests • A backup, which stores copies of objects from other masters on storage • A coordinator • Manages config info and object locations • Not involved in most requests

  8. RAMCloud Cluster Architecture client coordinator master backup

  9. More on the Coordinator • Maps objects to servers in units of tablets • Hold consecutive key ranges within a single table • For locality reasons • Small tables are stored on a single server • Large tables are split across servers • Clients can cache tablets to access servers directly

  10. Log-structured Storage • Logging approach • Each master logs data in memory • Log entries are forwarded to backup servers • Backup servers buffer log entries • Battery-backed • Writes complete once all backup servers acknowledge • A backup server flushes its buffer when full • 8MB segment for logging, buffering, and Ios • Each server can handle 300K 100-byte writes/sec

  11. Recovery • When a server crashes, its DRAM content must be reconstructed • 1-2 second recovery time is good enough

  12. Using Scale • Simple 3 replica approach • Recovery based on the speed of three disks • 3.5 minutes to read 64GB of data • Scattered over 1,000 disks • Takes 0.6 seconds to read 64GB • Centralized recovery master becomes a bottleneck • 10 Gbps network means 1 min to transfer 64GB of data to the centralized master

  13. RAMCloud • Uses 100 recovery masters • Cuts the time down to 1 second

  14. Scattering Log Segments • Ideally uniform, but with more details • Need to avoid correlated failures • Need to account for heterogeneity of hardware • Need to coordinate machines not to overflow buffers on individual machines • Need to account for changing memberships of servers due to failures

  15. Failure Detection • Periodic pings to random servers • With 99% chance to detect failed servers within 5 rounds • Recovery • Setup • Replay • Cleanup

  16. Setup • Coordinator finds log segment replicas • By querying all backup servers • Detecting incomplete logs • Logs are self describing • Starting Partition Recoveries • Each master uploads a will periodically to the coordinator in the event of its demise • Coordinator carries out the will accordingly

  17. Replay • Parallel recovery • Six stages of pipelining • At segment granularity • Same ordering of operations on segments to avoid pipeline stalls • Only the primary replicas is involved in recovery

  18. Cleanup • Get master online • Free up segments from the previous crash

  19. Consistency • Exactly-once semantics • Implementation not yet complete • ZooKeeper handles coordinator failures • Distributed configuration service • With its own replication

  20. Additional Failure Modes • Current focus • Recover DRAM content for a single master failure • Failed backup server • Need to know what segments are lost from the server • Rereplicate those lost segments across remaining disks

  21. Multiple Failures • Multiple servers fail simultaneously • Recover each failure independently • Some will involve secondary replicas • Based on projection • With 5,000 servers, recovering 40 masters within a rack will take about 2 seconds • Can’t do much when many racks are blacked out

  22. Cold Start • Complete power outage • Backups will contact the coordinate as they reboot • Need to quorum of backups before starting reconstructing masters • Current implementation does not perform cold starts

  23. Evaluation • 60-node cluster • Each node • 16GB RAM, 1 disk • Infiniband (25 Gbps) • User level apps can talk to NICs bypassing the kernel

  24. Results • Can recover lost data at 22 GB/s • A crashed server with 35 GB storage • Can be recovered in 1.6 seconds • Recovery time stays nearly flat from 1 to 20 recovery masters, each talks to 6 disks • 60 recovery masters adds only 10 ms recovery time

  25. Results • Fast recovery significantly reduces the risk of data loss • Assume recovery time of 1 sec • The risk of data loss for 100K node is 10-5 in one year • 10x improvement in recovery time, improves reliability by 1,000x • Assumes independent failures

  26. Theoretical Recovery Speed Limit • Harder to be faster than a few hundred msec • 150 msec to detect failure • 100 msec to contact every backup • 100 msec to read a single segment from disk

  27. Risks • Scalability study based on a small cluster • Can treat performance glitches as failures • Trigger unnecessary recovery • Access patterns can change dynamically • May lead to unbalanced load

More Related