500 likes | 609 Vues
In his 2003 Vanguard Conference talk, Prof. Eric A. Brewer presented five critical claims regarding giant-scale systems design and management. He emphasized the challenges of scalability and availability, the primacy of services over simple infrastructure, and offered insights on disaster tolerance and online evolution. Brewer argued that while P2P systems are fascinating, they may not be as pivotal as perceived. His claims continue to inform current practices in building resilient and scalable systems.
E N D
Some Claims aboutGiant-Scale Systems Prof. Eric A. BrewerUC Berkeley Vanguard Conference, May 5, 2003
Five Claims • Scalability is easy (availability is hard) • How to build a giant-scale system • Services are King (infrastructure centric) • P2P is cool but overrated • XML doesn’t help much • Need new IT for the Third World Vanguard Conference
Key New Problems • Unknown but large growth • Incremental & Absolute scalability • 1000’s of components • Must be truly highly available • Hot swap everything (no recovery time) • $6M/hour for stocks, $300k/hour for retailers • Graceful degradation under faults & saturation • Constant evolution (internet time) • Software will be buggy • Hardware will fail • These can’t be emergencies... Vanguard Conference
Typical Cluster Vanguard Conference
Scalability is EASY • Just add more nodes…. • Unless you want HA • Or you want to change the system… • Hard parts: • Availability • Overload • Evolution Vanguard Conference
Step 1: Basic Architecture Vanguard Conference
Step 2: Divide Persistent Data • Transactional (expensive, must be correct) • Supporting Data • HTML, images, applets, etc. • Read mostly • Publish in snapshots (normally stale) • Session state • Persistent across connections • Limited lifetime • Can be lost (rarely) Vanguard Conference
Step 3: Basic Availability • a) depend on layer-7 switches • Isolate external names (IP address, port) from specific machines • b) automatic detection of problems • Node-level checks (e.g. memory footprint) • (Remote) App-level checks • c) Focus on MTTR • Easier to fix & test than MTBF Vanguard Conference
Step 4: Overload Strategy • Goal: degrade service (some) to allow more capacity during overload • Examples: • Simpler pages (less dynamic, smaller size) • Less options (just the basics) • Must kick in automatically • “overload mode” • Relatively easy to detect Vanguard Conference
Step 5: Disaster Tolerance • a) pick a few locations • Independent failure • b) Dynamic redirection of load • Best: client-side control • Next: switch all traffic (long routes) • Worst: DNS remapping (takes a while) • c) Target site will get overloaded • -- but you have overload handling Vanguard Conference
Step 6: Online Evolution • Goal: rapid evolution without downtime • a) Publishing model • Decouple development from live system • Atomic push of content • Automatic revert if trouble arises • b) Three methods Vanguard Conference
Evolution: Three Approaches • Flash Upgrade • Fast reboot into new version • Focus on MTTR (< 10 sec) • Reduces yield (and uptime) • Rolling Upgrade • Upgrade nodes one at time in a “wave” • Temporary 1/n harvest reduction, 100% yield • Requires co-existing versions • “Big Flip” Vanguard Conference
The Big Flip • Steps: 1) take down 1/2 the nodes 2) upgrade that half 3) flip the “active half” (site upgraded) 4) upgrade second half 5) return to 100% • Avoids mixed versions (!) • can replace schema, protocols, ... • Twice used to change physical location Vanguard Conference
The Hat Trick • Merge the handling of: • Disaster tolerance • Online evolution • Overload handling • The first two reduce capacity, which then kicks in overload handling (!) Vanguard Conference
Coming of The Infrastructure Vanguard Conference
Infrastructure Services • Much simpler devices • lower cost & more functionality • longer battery life • Data is in the infrastructure • can lose the device • enables groupware • can update/access from home or work • phone book on the web, not in the phone • can use a real PC & keyboard • Much faster access • Surfing is 3-7 times faster • Graphics look good Vanguard Conference
Transformation Examples • Tailor content for each user & device 6.8x 65x 10x 1.2 The Remote Queue Model We introduce Remote Queues (RQ), ….
Infrastructure Services (2) • Much cheaper overall cost (20x?) • Device utilization = 4%, infrastructure = 80% • Admin & supports costs also decrease • “Super Convergence” (all -> IP) • View powerpoint slides with teleconference • Integrated cell phone, pager, web/email access • Map, driving directions, location-based services • Can upgrade/add services in place! • Devices last longer and grow in usefulness • Easy to deploy new services => new revenue Vanguard Conference
Internet Phases (prediction) • Internet as New Media • HTML, basic search • Consumer Services (today) • Shopping, travel, Yahoo!, eBay, tickets, … • Industrial Services • XML, micropayments, spot markets • Everything in the Infrastructure • store your data in the infrastructure • access anytime/anywhere Vanguard Conference
P2P Services? • Not soon… • Challenges • Untrusted nodes !! • Network Partitions • Much harder to understand behavior • Harder to upgrade • Relatively few advantages… Vanguard Conference
Better: Smart Clients • Mostly helps with: • Load balancing • Disaster tolerance • Overload • Can also offload work from servers • Can also personalize results • E.g. mix search results locally • Can include private data Vanguard Conference
The Problem • Need services for computers to use • HTML only works for people • Sites depend on human interpretation of ambiguous content • “Scaping” content is BAD • Very error prone • No strategy for evolution • XML doesn’t solve any of these issues! • At best: RPC with an extensible schema Vanguard Conference
Why it is hard… • The real problem is *social* • What do the fields mean? • Who gets to decide? • Doesn’t make evolution better… • Two sides still need to agree on schema • Can ignore stuff you don’t understand? • When can a field change? Consequences? • At least need versioning system… • XML can mislead us to ignore/postpone the real issues! Vanguard Conference
Plug for new area… • Bridging the IT gap is the only long-term path to global stability • Convergence makes it possible: • 802.11 wireless ($5/chipset) • Systems on a chip (cost, power) • Infrastructure services (cost, power) • Goal: 10-100x reduction in overall cost and power Vanguard Conference
Five Claims • Scalability is easy (availability is hard) • How to build a giant-scale system • Services are King (infrastructure centric) • P2P is cool but overrated • XML doesn’t help much • Need new IT for the Third World Vanguard Conference
Refinement • Retrieve part of distilled object at higher quality Zoom in to original resolution Distilled image (by 60X)
Consistency Availability Tolerance to networkPartitions The CAP Theorem Theorem: You can have at most two of these properties for any shared-data system Vanguard Conference
Consistency Availability Tolerance to networkPartitions Forfeit Partitions Examples • Single-site databases • Cluster databases • LDAP • xFS file system Traits • 2-phase commit • cache validation protocols Vanguard Conference
Consistency Availability Tolerance to networkPartitions Forfeit Availability Examples • Distributed databases • Distributed locking • Majority protocols Traits • Pessimistic locking • Make minority partitions unavailable Vanguard Conference
Consistency Availability Tolerance to networkPartitions Forfeit Consistency Examples • Coda • Web cachinge • DNS Traits • expirations/leases • conflict resolution • optimistic Vanguard Conference
These Tradeoffs are Real • The whole space is useful • Real internet systems are a careful mixture of ACID and BASE subsystems • We use ACID for user profiles and logging (for revenue) • But there is almost no work in this area • Symptom of a deeper problem: systems and database communities are separate but overlapping (with distinct vocabulary) Vanguard Conference
CAP Take Homes • Can have consistency & availability within a cluster (foundation of Ninja), but it is still hard in practice • OS/Networking good at BASE/Availability, but terrible at consistency • Databases better at C than Availability • Wide-area databases can’t have both • Disconnected clients can’t have both • All systems are probabilistic… Vanguard Conference
The DQ Principle Data/query * Queries/sec = constant = DQ • for a given node • for a given app/OS release • A fault can reduce the capacity (Q), completeness (D) or both • Faults reduce this constant linearly (at best) Vanguard Conference
Harvest & Yield • Yield: Fraction of Answered Queries • Related to uptime but measured by queries, not by time • Drop 1 out of 10 connections => 90% yield • At full utilization: yield ~ capacity ~ Q • Harvest: Fraction of the Complete Result • Reflects that some of the data may be missing due to faults • Replication: maintain D under faults • DQ corollary: harvest * yield ~ constant • ACID => choose 100% harvest (reduce Q but 100% D) • Internet => choose 100% yield (available but reduced D) Vanguard Conference
RAID RAID Harvest Options 1) Ignore lost nodes • RPC gives up • forfeit small part of the database • reduce D, keep Q 2) Pair up nodes • RPC tries alternate • survives one fault per pair • reduce Q, keep D 3) n-member replica groups Decide when you care...
Replica Groups With n members: • Each fault reduces Q by 1/n • D stable until nth fault • Added load is 1/(n-1) per fault • n=2 => double load or 50% capacity • n=4 => 133% load or 75% capacity • “load redirection problem” • Disaster tolerance: better have >3 mirrors Vanguard Conference
Graceful Degradation • Goal: smooth decrease in harvest/yield proportional to faults • we know DQ drops linearly • Saturation will occur • high peak/average ratios... • must reduce harvest or yield (or both) • must do admission control!!! • One answer: reduce D dynamically • disaster => redirect load, then reduce D to compensate for extra load Vanguard Conference
Thinking Probabilistically • Maximize symmetry • SPMD + simple replication schemes • Make faults independent • requires thought • avoid cascading errors/faults • understand redirected load • KISS • Use randomness • makes worst-case and average case the same • ex: Inktomi spreads data & queries randomly • Node loss implies a random 1% harvest reduction Vanguard Conference
Server Pollution • Can’t fix all memory leaks • Third-party software leaks memory and sockets • so does the OS sometimes • Some failures tie up local resources Solution: planned periodic “bounce” • Not worth the stress to do any better • Bounce time is less than 10 seconds • Nice to remove load first… Vanguard Conference
Key New Problems • Unknown but large growth • Incremental & Absolute scalability • 1000’s of components • Must be truly highly available • Hot swap everything (no recovery time allowed) • No “night” • Graceful degradation under faults & saturation • Constant evolution (internet time) • Software will be buggy • Hardware will fail • These can’t be emergencies... Vanguard Conference
Conclusions • Parallel Programming is very relevant, except… • historically avoids availability • no notion of online evolution • limited notions of graceful degradation (checkpointing) • best for CPU-bound tasks • Must think probabilistically about everything • no such thing as a 100% working system • no such thing as 100% fault tolerance • partial results are often OK (and better than none) • Capacity * Completeness == Constant Vanguard Conference
Partial checklist • What is shared? (namespace, schema?) • What kind of state in each boundary? • How would you evolve an API? • Lifetime of references? Expiration impact? • Graceful degradation as modules go down? • External persistent names? • Consistency semantics and boundary? Vanguard Conference
The Move to Clusters • No single machine can handle the load • Only solution is clusters • Other cluster advantages: • Cost: about 50% cheaper per CPU • Availability: possible to build HA systems • Incremental growth: add nodes as needed • Replace whole nodes (easier) Vanguard Conference
Goals • Sheer scale • Handle 100M users, going toward 1B • Largest: AOL Web Cache: 12B hits/day • High Availability • Large cost for downtime • $250K per hour for online retailers • $6M per hour for stock brokers • Disaster Tolerance? • Overload Handling • System Evolution • Decentralization? Vanguard Conference