1 / 58

PNUTS Building and running a cloud database system Brian Cooper

PNUTS Building and running a cloud database system Brian Cooper. Overview. Building a cloud service How PNUTS works “Advanced” features Lessons learned. Yahoo!. Yahoo! has almost 100 properties Mail, Messenger, Finance, Shopping, Sports, OMG! … 20 properties are #1 or #2

mallika
Télécharger la présentation

PNUTS Building and running a cloud database system Brian Cooper

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PNUTS Building and running a cloud database system Brian Cooper

  2. Overview • Building a cloud service • How PNUTS works • “Advanced” features • Lessons learned

  3. Yahoo! • Yahoo! has almost 100 properties • Mail, Messenger, Finance, Shopping, Sports, OMG! … • 20 properties are #1 or #2 • Yahoo! is #1 in time spent online in U.S. (10.5%) • 164 million unique U.S. visitors in January • 79 percent of U.S. online audience • 598 million unique worldwide visitors in January • 48 percent of global online audience • This is where we make our money! • Users coming to Yahoo! sites and spending time • We are focusing on the “audience” side of Yahoo! • Not the search engine (Jan. 2010, source: ComScore)

  4. CLOUD COMPUTING 4

  5. Why? Two competing needs • Accelerating innovation • Focus on building your application, not the infrastructure • Increasing availability • Without infinite hardware and system operators How will cloud services help? • Cloud services will perform the heavy lifting of scaling & high-availability • Focus on horizontal cloud services • Platforms to support multiple vertical applications

  6. Requirements for Cloud Services • Multi-tenancy • Support for multiple, organizationally distant customers • Horizontal scaling • Add cloud capacity incrementally and transparently as needed by tenants • Elasticity • Tenants can request and receive resources on-demand, paying for usage • Security & Account management • Accounts/IDs, authentication, access control; isolate tenants; data security • Availability & Operability • High availability and reliability over commodity hardware • Easy to operate, with few operators; automated monitoring & metering

  7. Cloud Data Management Systems • CRUD • Point lookups and short scans • Index organized table and random I/Os • $ per latency • Scan oriented workloads • Focus on sequential disk I/O • $ per cpu cycle Structured record storage (PNUTS) Large data analysis (Hadoop) • Object retrieval and streaming • Scalable file storage • $ per GB Blob storage (MObStor)

  8. What Makes a Cloud Data Service? • DBA to the world! • Many apps • Each with hundreds or thousands of client processes • Must automanage – cannot manually tweak knobs • Must autobalance – load will constantly shift • Massive scalability • Scaling up via shared or specialized hardware is infeasible • Scale out with commodity hardware – 10,000 or 100,000 servers • Failures are the common case • Must continue to operate in the face of servers down • Must autoscale – plug in new servers and let them go These capabilities must be baked in from the start

  9. WHAT IS PNUTS? 9

  10. What are my friends up to? Sonja: Brandon: Example: social network updates Brian Sonja Jimi Brandon Kurt

  11. Example: social network updates 6 Jimi <ph.. 8 Mary <re.. 12 Sonja <ph.. 15 Brandon <po.. 16 Mike <ph.. <photo> <title>Flower</title> <url>www.flickr.com</url> </photo> 17 Bob <re.. (caveat: not necessarily how our Y! Updates product actually works)

  12. The world has changed • Can trade away “standard” DBMS features: • Complicated queries • Strong transactions • But I must have my scalability, flexibility and availability!

  13. The PNUTS Solution Record-orientation: Optimized for low-latency record access Scale out: Add machines to scale throughput Asynchrony: Avoid expensive synchronous operations Consistency model: Hide complexity of asynchronous replication Flexible access: Hashed or ordered, indexes, views; flexible schemas Cloud deployment model: Hosted, managed service [VLDB 08]

  14. PNots • Not a SQL database • Simple queries, simple transaction model • Not a parallel processing engine • Though it can play well with MapReduce • Not a filesystem • Record storage, not blob storage • Not peer-to-peer • We own the servers and can save some complexity • Servers organized into natural groups (datacenters)

  15. Data Model

  16. Query Model Simple call API Get Set Delete Scan Getrange Scan and Getrange with predicate Web service (RESTful) API Encode data as JSON 16

  17. Representing sparse data $ curl http://pnuts.yahoo.com/PNUTSWebService/V1/get/userTable/yahoo {"record":{ "status":{"code":200,"message":"OK"}, "metadata":{ "seq_id":"5", "modtime":1234231551, "disk_size":89}, "fields": { "addr":{"value":"700 First Ave"}, "city":{"value":"Sunnyvale"}, "state":{"value":"CA"} } } } (some details changed to protect the innocent)

  18. DISTRIBUTION 18

  19. Architecture Clients REST API Routers Tablet controller Log servers Storage units 19

  20. Tablet Splitting and Balancing Storage unit Tablet Each storage unit has many tablets (horizontal partitions of the table) Storage unit may become a hotspot Tablets may grow over time Overfull tablets split Shed load by moving tablets to other servers 20

  21. Tablets—Hash Table Name Description Price 0x0000 $12 Grape Grapes are good to eat $9 Limes are green Lime $1 Apple Apple is wisdom $900 Strawberry Strawberry shortcake 0x2AF3 $2 Orange Arrgh! Don’t get scurvy! $3 Avocado But at what price? Lemon How much did you pay for this lemon? $1 $14 Is this a vegetable? Tomato 0x911F $2 The perfect fruit Banana $8 Kiwi New Zealand 0xFFFF 21

  22. Tablets—Ordered Table Name Description Price A $1 Apple Apple is wisdom $3 Avocado But at what price? $2 Banana The perfect fruit $12 Grape Grapes are good to eat H $8 Kiwi New Zealand Lemon $1 How much did you pay for this lemon? Limes are green Lime $9 $2 Orange Arrgh! Don’t get scurvy! Q $900 Strawberry Strawberry shortcake $14 Is this a vegetable? Tomato Z 22

  23. Accessing Data Record for key k Get key k Record for key k 1 2 3 4 Get key k 23

  24. Updates Log servers make storage units disposable Write key k 3 2 7 6 8 5 4 1 Sequence # for key k Write key k Routers Log servers Write key k Sequence # for key k SUCCESS SU SU SU Write key k 24

  25. ASYNCHRONOUS REPLICATION AND CONSISTENCY 25

  26. Asynchronous Replication 26

  27. Global Replication (not necessarily actual Yahoo! datacenters)

  28. Goal: make it easier for applications to reason about updates and cope with asynchrony What happens to a record with primary key “Brian”? We also support an eventual consistency model Applications can choose which kind of table to create Consistency Model Record inserted Delete Update Update Update Update Update Update Update v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Time Generation 1 28

  29. Timeline Model Read Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1 29

  30. Timeline Model Read up-to-date Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1 30

  31. Timeline Model Read ≥ v.6 Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1 31

  32. Timeline Model Write Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1 32

  33. Timeline Model Write if = v.7 ERROR Stale version Current version Stale version v. 2 v. 5 v. 1 v. 3 v. 4 v. 6 v. 7 v. 8 Time Generation 1 33

  34. (Alice, Home, Awake) (Alice, Work, Awake) Awake Work Awake Work (Alice, Work, Sleeping) (Alice, Work, Awake) “Invalid” state visible Consistency levels • Eventual consistency • Transactions: • Alice changes status from “Sleeping” to “Awake” • Alice changes location from “Home” to “Work” (Alice, Home, Sleeping) Region 1 Final state consistent (Alice, Home, Sleeping) Region 2

  35. (Alice, Home, Awake) Work Awake (Alice, Work, Awake) Consistency levels • Timeline consistency • Transactions: • Alice changes status from “Sleeping” to “Awake” • Alice changes location from “Home” to “Work” (Alice, Home, Sleeping) (Alice, Work, Awake) Region 1 (Alice, Work, Awake) Work (Alice, Home, Sleeping) Region 2

  36. Mastering A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E A 42342 E A 42342 E B 42521 W B 42521 E C 66354 W C 66354 W D 12352 E D 12352 E E 75656 C E 75656 C F 15677 E F 15677 E A 42342 E B 42521 E C 66354 W D 12352 E E 75656 C F 15677 E A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E 36

  37. Coping With Failures X OVERRIDE W → E X A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E 37

  38. “ADVANCED” FEATURES 38

  39. Time ranges Relationship graphs Hierarchical data Indexes and views Ordered tables Ordered tables provide efficient scanning of clustered subranges

  40. Ordered tables are tricky • Hotspots! • Solution: Proactive load balancing • Move tablets from hot servers to cold servers • If necessary, split hot tablets

  41. Parallel scans Client Scan engine

  42. Adaptive server allocation Client Scan engine

  43. Server scheduling Client 2 Client 1 Scan engine

  44. Indexes and views • How to have lots of interesting indexes, without killing performance? • Solution: Asynchrony! • Indexes updated asynchronously when base table updated • Some interesting views can be represented as indexes

  45. View types • Index – Remote view table ByAuthor view table Base table

  46. View types • Equijoin – Co-clustered remote view tables • Each sub-table managed like an index PostComments view table Comments table Posts view table

  47. VM Remote view tables • A regular table, but updated by the view maintainer instead of a client Update SU Log server Log server

  48. SOME NUMBERS 48

  49. Performance comparison • Setup • Six server-class machines • 8 cores (2 x quadcore) 2.5 GHz CPUs • 8 GB RAM • 6 x 146GB 15K RPM SAS drives in RAID 1+0 • Gigabit ethernet • RHEL 4 • Plus extra machines for clients, routers, controllers, etc. • Workloads • 120 million 1 KB records = 20 GB per server • Write heavy workload: 50/50 read/update • Updates write the whole record • 50 client processes usually; up to 300 needed to generate higher throughputs • Obviously many variations are possible; these are just two points in the space • Metrics • Latency versus throughput curves • Caveats • Write performance would be improved for Sherpa, Sharded and Cassandra with a dedicated log disk • We tuned each system as well as we knew how

  50. Results

More Related