1 / 17

Distributed Systems

Distributed Systems. Tutorial 11 – Yahoo! PNUTS written by Alex Libov Based on OSCON 2011 presentation winter semester, 2013-2014. Yahoo! PNUTS. A massively parallel and geographically distributed database system for Yahoo!’s web applications

ena
Télécharger la présentation

Distributed Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Systems Tutorial 11 – Yahoo! PNUTS written by Alex Libov Based on OSCON 2011 presentation winter semester, 2013-2014

  2. Yahoo! PNUTS • A massively parallel and geographically distributed database system for Yahoo!’s web applications • provides data storage organized as hashed or ordered tables • low latency for large numbers of concurrent requests including updates and queries • per-record consistency guarantees

  3. Consistency • Serializability of general transaction is inefficient and often unnecessary • If a user changes an avatar, posts new pictures, or invites several friends to connect, little harm is done if the new avatar is not initially visible to one friend • Many distributed applications go to the extreme of providing only eventual consistency • Too weak and inadequate for web applications • PNUTS suggests a consistency model that falls between those two extremes

  4. SYSTEM ARCHITECTURE • Data is organized into tables of records with attributes • In addition to typical data types, “blob” is a valid data type, allowing arbitrary structures inside a record • Data tables are horizontally partitioned into groups of records called tablets. • Tabletsare scattered across many servers • each server might have hundreds or thousands of tablets, but each tablet is stored on a single server within a region

  5. Distributed Hash Table 0x0000 0x2AF3 Tablet 0x911F

  6. Distributed Hash Table Tablet clustered by key range

  7. Query model • PNUTS supports very simple queries sacrificing rich API in favor of response time and overall simplicity • No joins, group-by, etc. • This is stated as future work • The system is designed to work well with queries that read and write single records or small groups of records

  8. PNUTS-Single Region • a single pair of active/standby servers • Maintains map from database.table.key to tablet to storage-unit • Routes client requests to correct storage unit • Caches the maps from the tablet controller • Stores records • Services get/set/delete requests

  9. Tablet Splitting & Balancing Each storage unit has many tablets (horizontal partitions of the table) Storage unit may become a hotspot Tablets may grow over time Overfull tablets split Shed load by moving tablets to other servers

  10. Consistency Options • Eventual Consistency • Low latency updates and inserts done locally • Record Timeline Consistency • Each record is assigned a “master region” • Inserts succeed, but updates could fail during outages • Primary Key Constraint + Record Timeline • Each tablet and record is assigned a “master region” • Inserts and updates could fail during outages Availability Consistency

  11. Record Timeline Consistency • One of the replicas is designated as the master • Per record • All updates to that record are forwarded to the master • If a replica is receiving the majority of write requests – it becomes the master • Each update advances the generation of the record

  12. (Alice, Home, Awake) Work Awake (Alice, Work, Awake) Record Timeline Consistency Transactions: • Alice changes status from “Sleeping” to “Awake” • Alice changes location from “Home” to “Work” (Alice, Home, Sleeping) (Alice, Work, Awake) Region 1 (Alice, Work, Awake) Work (Alice, Home, Sleeping) Region 2 No replica should see record as (Alice, Work, Sleeping)

  13. API calls • Read-any • Returns a possibly stale version of the record. • The returned record is always a valid one from the record’s history. • This call has lower latency than other read calls with stricter guarantees • Read-critical(required version) • Returns a version of the record that is strictly newer than, or the same as the required version. • Read-latest • Returns the latest copy of the record that reflects all writes that have succeeded. • Write • This call gives the same ACID guarantees as a transaction with a single write operation in it. This call is useful for blind writes, e.g., a user updating his status on his profile. • Test-and-set-write(required version) • This call performs the requested write to the record if and only if the present version of the record is the same as required version.

  14. Eventual Consistency • Timeline consistencycomes at a price • Writes not originating in record master region forward to master and have longer latency • The mastership of a record can migrate between replicas • When master region down, record is unavailable for write • eventual consistencymode • On conflict, latest write per field wins • Target customers • Those that externally guarantee no conflicts • Those that understand/can cope

  15. Yahoo! Message Broker (YMB) • A topic-based publish/subscribe system • Data updates are considered “committed” when they have been published to YMB. • At some point after being committed, the update will be asynchronously propagated to different regions and applied to their replicas • YMB guarantees that published messages will be delivered to all topic subscribers even in the presence of single broker machine failures • by logging the message to multiple disks on different servers. two copies are logged initially, and more copies are logged as the message propagates • The message is not purged from the YMB log until PNUTS has verified that the update is applied to all replicas of the database • YMB provides partial ordering of published messages. • Messages published to a particular YMB cluster will be delivered to all subscribers in the order they were published

  16. Recovery • Recovering from a failure involves copying lost tablets from another replica. • A three step process • The tablet controller requests a copy from a particular remote replica (the “source tablet”). • A “checkpoint message” is published to YMB, to ensure that any in-flight updates at the time the copy is initiated are applied to the source tablet. • The source tablet is copied to the destination region. • To support this recovery protocol, tablet boundaries are kept synchronized across replicas, and tablet splits are conducted by having all regions split a tablet at the same point • coordinated by a two-phase commit between regions.

  17. For more info http://www.mpi-sws.org/~druschel/courses/ds/papers/cooper-pnuts.pdf

More Related