Some Claims about Giant-Scale Systems

Some Claims aboutGiant-Scale Systems Prof. Eric A. BrewerUC Berkeley Vanguard Conference, May 5, 2003

Five Claims • Scalability is easy (availability is hard) • How to build a giant-scale system • Services are King (infrastructure centric) • P2P is cool but overrated • XML doesn’t help much • Need new IT for the Third World Vanguard Conference

Claim 1: Availability >> Scalability

Key New Problems • Unknown but large growth • Incremental & Absolute scalability • 1000’s of components • Must be truly highly available • Hot swap everything (no recovery time) • $6M/hour for stocks, $300k/hour for retailers • Graceful degradation under faults & saturation • Constant evolution (internet time) • Software will be buggy • Hardware will fail • These can’t be emergencies... Vanguard Conference

Typical Cluster Vanguard Conference

Scalability is EASY • Just add more nodes…. • Unless you want HA • Or you want to change the system… • Hard parts: • Availability • Overload • Evolution Vanguard Conference

Step 1: Basic Architecture Vanguard Conference

Step 2: Divide Persistent Data • Transactional (expensive, must be correct) • Supporting Data • HTML, images, applets, etc. • Read mostly • Publish in snapshots (normally stale) • Session state • Persistent across connections • Limited lifetime • Can be lost (rarely) Vanguard Conference

Step 3: Basic Availability • a) depend on layer-7 switches • Isolate external names (IP address, port) from specific machines • b) automatic detection of problems • Node-level checks (e.g. memory footprint) • (Remote) App-level checks • c) Focus on MTTR • Easier to fix & test than MTBF Vanguard Conference

Step 4: Overload Strategy • Goal: degrade service (some) to allow more capacity during overload • Examples: • Simpler pages (less dynamic, smaller size) • Less options (just the basics) • Must kick in automatically • “overload mode” • Relatively easy to detect Vanguard Conference

Step 5: Disaster Tolerance • a) pick a few locations • Independent failure • b) Dynamic redirection of load • Best: client-side control • Next: switch all traffic (long routes) • Worst: DNS remapping (takes a while) • c) Target site will get overloaded • -- but you have overload handling Vanguard Conference

Step 6: Online Evolution • Goal: rapid evolution without downtime • a) Publishing model • Decouple development from live system • Atomic push of content • Automatic revert if trouble arises • b) Three methods Vanguard Conference

Evolution: Three Approaches • Flash Upgrade • Fast reboot into new version • Focus on MTTR (< 10 sec) • Reduces yield (and uptime) • Rolling Upgrade • Upgrade nodes one at time in a “wave” • Temporary 1/n harvest reduction, 100% yield • Requires co-existing versions • “Big Flip” Vanguard Conference

The Big Flip • Steps: 1) take down 1/2 the nodes 2) upgrade that half 3) flip the “active half” (site upgraded) 4) upgrade second half 5) return to 100% • Avoids mixed versions (!) • can replace schema, protocols, ... • Twice used to change physical location Vanguard Conference

The Hat Trick • Merge the handling of: • Disaster tolerance • Online evolution • Overload handling • The first two reduce capacity, which then kicks in overload handling (!) Vanguard Conference

Claim 2:Services are King

Coming of The Infrastructure Vanguard Conference

Infrastructure Services • Much simpler devices • lower cost & more functionality • longer battery life • Data is in the infrastructure • can lose the device • enables groupware • can update/access from home or work • phone book on the web, not in the phone • can use a real PC & keyboard • Much faster access • Surfing is 3-7 times faster • Graphics look good Vanguard Conference

Transformation Examples • Tailor content for each user & device 6.8x 65x 10x 1.2 The Remote Queue Model We introduce Remote Queues (RQ), ….

Infrastructure Services (2) • Much cheaper overall cost (20x?) • Device utilization = 4%, infrastructure = 80% • Admin & supports costs also decrease • “Super Convergence” (all -> IP) • View powerpoint slides with teleconference • Integrated cell phone, pager, web/email access • Map, driving directions, location-based services • Can upgrade/add services in place! • Devices last longer and grow in usefulness • Easy to deploy new services => new revenue Vanguard Conference

Internet Phases (prediction) • Internet as New Media • HTML, basic search • Consumer Services (today) • Shopping, travel, Yahoo!, eBay, tickets, … • Industrial Services • XML, micropayments, spot markets • Everything in the Infrastructure • store your data in the infrastructure • access anytime/anywhere Vanguard Conference

Claim 3:P2P is cool but overrated

P2P Services? • Not soon… • Challenges • Untrusted nodes !! • Network Partitions • Much harder to understand behavior • Harder to upgrade • Relatively few advantages… Vanguard Conference

Better: Smart Clients • Mostly helps with: • Load balancing • Disaster tolerance • Overload • Can also offload work from servers • Can also personalize results • E.g. mix search results locally • Can include private data Vanguard Conference

Claim 4:XML doesn’t help much…

The Problem • Need services for computers to use • HTML only works for people • Sites depend on human interpretation of ambiguous content • “Scaping” content is BAD • Very error prone • No strategy for evolution • XML doesn’t solve any of these issues! • At best: RPC with an extensible schema Vanguard Conference

Why it is hard… • The real problem is *social* • What do the fields mean? • Who gets to decide? • Doesn’t make evolution better… • Two sides still need to agree on schema • Can ignore stuff you don’t understand? • When can a field change? Consequences? • At least need versioning system… • XML can mislead us to ignore/postpone the real issues! Vanguard Conference

Claim 5:New IT for the Third World

Plug for new area… • Bridging the IT gap is the only long-term path to global stability • Convergence makes it possible: • 802.11 wireless ($5/chipset) • Systems on a chip (cost, power) • Infrastructure services (cost, power) • Goal: 10-100x reduction in overall cost and power Vanguard Conference

Five Claims • Scalability is easy (availability is hard) • How to build a giant-scale system • Services are King (infrastructure centric) • P2P is cool but overrated • XML doesn’t help much • Need new IT for the Third World Vanguard Conference

Backup

Refinement • Retrieve part of distilled object at higher quality Zoom in to original resolution Distilled image (by 60X)

Consistency Availability Tolerance to networkPartitions The CAP Theorem Theorem: You can have at most two of these properties for any shared-data system Vanguard Conference

Consistency Availability Tolerance to networkPartitions Forfeit Partitions Examples • Single-site databases • Cluster databases • LDAP • xFS file system Traits • 2-phase commit • cache validation protocols Vanguard Conference

Consistency Availability Tolerance to networkPartitions Forfeit Availability Examples • Distributed databases • Distributed locking • Majority protocols Traits • Pessimistic locking • Make minority partitions unavailable Vanguard Conference

Consistency Availability Tolerance to networkPartitions Forfeit Consistency Examples • Coda • Web cachinge • DNS Traits • expirations/leases • conflict resolution • optimistic Vanguard Conference

These Tradeoffs are Real • The whole space is useful • Real internet systems are a careful mixture of ACID and BASE subsystems • We use ACID for user profiles and logging (for revenue) • But there is almost no work in this area • Symptom of a deeper problem: systems and database communities are separate but overlapping (with distinct vocabulary) Vanguard Conference

CAP Take Homes • Can have consistency & availability within a cluster (foundation of Ninja), but it is still hard in practice • OS/Networking good at BASE/Availability, but terrible at consistency • Databases better at C than Availability • Wide-area databases can’t have both • Disconnected clients can’t have both • All systems are probabilistic… Vanguard Conference

The DQ Principle Data/query * Queries/sec = constant = DQ • for a given node • for a given app/OS release • A fault can reduce the capacity (Q), completeness (D) or both • Faults reduce this constant linearly (at best) Vanguard Conference

Harvest & Yield • Yield: Fraction of Answered Queries • Related to uptime but measured by queries, not by time • Drop 1 out of 10 connections => 90% yield • At full utilization: yield ~ capacity ~ Q • Harvest: Fraction of the Complete Result • Reflects that some of the data may be missing due to faults • Replication: maintain D under faults • DQ corollary: harvest * yield ~ constant • ACID => choose 100% harvest (reduce Q but 100% D) • Internet => choose 100% yield (available but reduced D) Vanguard Conference

RAID RAID Harvest Options 1) Ignore lost nodes • RPC gives up • forfeit small part of the database • reduce D, keep Q 2) Pair up nodes • RPC tries alternate • survives one fault per pair • reduce Q, keep D 3) n-member replica groups Decide when you care...

Replica Groups With n members: • Each fault reduces Q by 1/n • D stable until nth fault • Added load is 1/(n-1) per fault • n=2 => double load or 50% capacity • n=4 => 133% load or 75% capacity • “load redirection problem” • Disaster tolerance: better have >3 mirrors Vanguard Conference

Graceful Degradation • Goal: smooth decrease in harvest/yield proportional to faults • we know DQ drops linearly • Saturation will occur • high peak/average ratios... • must reduce harvest or yield (or both) • must do admission control!!! • One answer: reduce D dynamically • disaster => redirect load, then reduce D to compensate for extra load Vanguard Conference

Thinking Probabilistically • Maximize symmetry • SPMD + simple replication schemes • Make faults independent • requires thought • avoid cascading errors/faults • understand redirected load • KISS • Use randomness • makes worst-case and average case the same • ex: Inktomi spreads data & queries randomly • Node loss implies a random 1% harvest reduction Vanguard Conference

Server Pollution • Can’t fix all memory leaks • Third-party software leaks memory and sockets • so does the OS sometimes • Some failures tie up local resources Solution: planned periodic “bounce” • Not worth the stress to do any better • Bounce time is less than 10 seconds • Nice to remove load first… Vanguard Conference

Key New Problems • Unknown but large growth • Incremental & Absolute scalability • 1000’s of components • Must be truly highly available • Hot swap everything (no recovery time allowed) • No “night” • Graceful degradation under faults & saturation • Constant evolution (internet time) • Software will be buggy • Hardware will fail • These can’t be emergencies... Vanguard Conference

Conclusions • Parallel Programming is very relevant, except… • historically avoids availability • no notion of online evolution • limited notions of graceful degradation (checkpointing) • best for CPU-bound tasks • Must think probabilistically about everything • no such thing as a 100% working system • no such thing as 100% fault tolerance • partial results are often OK (and better than none) • Capacity * Completeness == Constant Vanguard Conference

Partial checklist • What is shared? (namespace, schema?) • What kind of state in each boundary? • How would you evolve an API? • Lifetime of references? Expiration impact? • Graceful degradation as modules go down? • External persistent names? • Consistency semantics and boundary? Vanguard Conference

The Move to Clusters • No single machine can handle the load • Only solution is clusters • Other cluster advantages: • Cost: about 50% cheaper per CPU • Availability: possible to build HA systems • Incremental growth: add nodes as needed • Replace whole nodes (easier) Vanguard Conference

Goals • Sheer scale • Handle 100M users, going toward 1B • Largest: AOL Web Cache: 12B hits/day • High Availability • Large cost for downtime • $250K per hour for online retailers • $6M per hour for stock brokers • Disaster Tolerance? • Overload Handling • System Evolution • Decentralization? Vanguard Conference

Some Claims about Giant-Scale Systems

Some Claims about Giant-Scale Systems

Presentation Transcript

Some Review: Small Scale Winds

Large-scale adaptive systems

Large-Scale Distributed Systems

SOME COMMENTS ABOUT THE GIANT CHINESE SOLAR TELESCOPE CONCEPT

Large-scale adaptive systems

SOME FACTS ABOUT

Large-scale adaptive systems

BioModular Multi-Scale Systems

Large-Scale Distributed Systems

Giant Resonances - Some Challenges from Recent Experiments

Meso- - Scale and Meso- -Scale Convective Systems

Large Scale Computing Systems

Large Scale File Systems

Let’s take some time to learn about the giant panda!!

Some facts about

Large-Scale Systems

Some about

Large-scale adaptive systems

Some baseless Claims About Tanning Salons In Tulsa Not Worth Listening To

Know About Legal Claims Administrators

All About Personal Injury Claims

Cloud Scale Storage Systems