1 / 68

Cloud Computing and Data Centers: Overview

Cloud Computing and Data Centers: Overview. What’s Cloud Computing? Data Centers and “Computing at Scale” Case Studies: Google File System Map-Reduce Programming Model Optional Material Google Bigtable Readings: Do required readings Also do some of the optional readings if interested.

pia
Télécharger la présentation

Cloud Computing and Data Centers: Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cloud Computing and Data Centers:Overview • What’s Cloud Computing? • Data Centers and “Computing at Scale” • Case Studies: • Google File System • Map-Reduce Programming Model Optional Material • Google Bigtable Readings:Do required readings Also do some of the optional readings if interested

  2. Using Google as an example: GFS, MapReduce, etc. mostly related to distributed systems, not really “networking” stuff Two Primary Goals: they represent part of current and “future” trends how applications will be serviced, delivered, … what are important “new” networking problems? more importantly, what lessons can we learn in terms of (future) networking design? closely related, and there are many similar issues/challenges (availability, reliability, scalability, manageability, ….) (but of course, there are also unique challenges in networking) Why Studying Cloud Computing and Data Centers

  3. Internet and Web • Simple client-server model • a number of clients served by a single server • performance determined by “peak load” • doesn’t scale well (e.g., server crashes), when # of clients suddenly increases -- “flash crowd” • From single server to blade server to server farm (or data center)

  4. Internet and Web … • From “traditional” web to “web service” (or SOA) • no longer simply “file” (or web page) downloads • pages often dynamically generated, more complicated “objects” (e.g., Flash videos used in YouTube) • HTTP is used simply as a “transfer” protocol • many other “application protocols” layered on top of HTTP • web services & SOA (service-oriented architecture) • A schematic representation of “modern” web services database, storage, computing, … web rendering, request routing, aggregators, … front-end back-end

  5. Data Center and Cloud Computing • Data center: large server farms + data warehouses • not simply for web/web services • managed infrastructure: expensive! • From web hosting to cloud computing • individual web/content providers: must provision for peak load • Expensive, and typically resources are under-utilized • web hosting: third party provides and owns the (server farm) infrastructure, hosting web services for content providers • “server consolidation” via virtualization Under client web service control App Guest OS VMM

  6. Cloud Computing • Cloud computing and cloud-based services: • beyond web-based “information access” or “information delivery” • computing, storage, … • Cloud Computing: NIST Definition "Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction." • Models of Cloud Computing • “Infrastructure as a Service” (IaaS), e.g., Amazon EC2, Rackspace • “Platform as a Service” (PaaS), e.g., Micorsoft Azure • “Software as a Service” (SaaS), e.g., Google

  7. With thousands of servers within a data center, How to write applications (services) for them? How to allocate resources, and manage them? in particular, how to ensure performance, reliability, availability, … Scale and complexity bring other key challenges with thousands of machines, failures are the default case! load-balancing, handling “heterogeneity,” … data center (server cluster) as a “computer” “super-computer” vs. “cluster computer” A single “super-high-performance” and highly reliable computer vs. a “computer” built out of thousands of “cheap & unreliable” PCs Pros and cons? Data Centers: Key Challenges

  8. Google File System (GFS) a “file system” (or “OS”) for “cluster computer” An “overlay” on top of “native” OS on individual machines designed with certain (common) types of applications in mind, and designed with failures as default cases Google MapReduce (cf. Microsoft Dryad) MapReduce: a new “programming paradigm” for certain (common) types of applications, built on top of GFS Other examples (optional): BigTable: a (semi-) structured database for efficient key-value queries, etc. , built on top of GFS Amazon Dynamo:A distributed <key, value> storage system high availability is a key design goal Google’s Chubby, Sawzall, etc. Open source systems: Hadoop, … Case Studies

  9. Google Scale and Philosophy • Lots of data • copies of the web, satellite data, user data, email and USENET, Subversion backing store • Workloads are large and easily parallelizable • No commercial system big enough • couldn’t afford it if there was one • might not have made appropriate design choices • But truckloads of low-cost machines • 450,000 machines (NYTimes estimate, June 14th 2006) • Failures are the norm • Even reliable systems fail at Google scale • Software must tolerate failures • Which machine an application is running on should not matter • Firm believers in the “end-to-end” argument • Care about perf/$, not absolute machine perf

  10. Cluster Scheduling Master Lock Service GFS Master Machine 2 Machine 3 Machine 1 BigTableServer UserTask 1 BigTableServer BigTable Master UserTask User Task 2 SchedulerSlave GFSChunkserver SchedulerSlave GFSChunkserver SchedulerSlave GFSChunkserver Linux Linux Linux Typical Cluster at Google

  11. Google: System Building Blocks • Google File System (GFS): • raw storage • (Cluster) Scheduler: • schedules jobs onto machines • Lock service: • distributed lock manager • also can reliably hold tiny files (100s of bytes) w/ high availability • Bigtable: • a multi-dimensional database • MapReduce: • simplified large-scale data processing • ....

  12. Chubby: Distributed Lock Service • {lock/file/name} service • Coarse-grained locks, can store small amount of data in a lock • 5 replicas, need a majority vote to be active • Also an OSDI ’06 Paper

  13. Google File System Key Design Considerations • Component failures are the norm • hardware component failures, software bugs, human errors, power supply issues, … • Solutions: built-in mechanisms for monitoring, error detection, fault tolerance, automatic recovery • Files are huge by traditional standards • multi-GB files are common, billions of objects • most writes (modifications or “mutations”) are “append” • two types of reads: large # of “stream” (i.e., sequential) reads, with small # of “random” reads • High concurrency (multiple “producers/consumers” on a file) • atomicity with minimal synchronization • Sustained bandwidth more important than latency

  14. GFS Architectural Design • A GFS cluster: • a single master + multiple chunkservers per master • running on commodity Linux machines • A file: a sequence of fixed-sized chunks (64 MBs) • labeled with 64-bit unique global IDs, • stored at chunkservers (as “native” Linux files, on local disk) • each chunk mirrored across (default 3) chunkservers • master server: maintains all metadata • name space, access control, file-to-chunk mappings, garbage collection, chunk migration • why only a single master? (with read-only shadow masters) • simple, and only answer chunk location queries to clients! • chunk servers (“slaves” or “workers”): • interact directly with clients, perform reads/writes, …

  15. GFS Architecture: Illustration • GPS clients • consult master for metadata • typically ask for multiple chunk locations per request • access data from chunkservers Separation of control and data flows

  16. Chunk Size and Metadata • Chunk size: 64 MBs • fewer chunk location requests to the master • client can perform many operations on a chuck • reduce overhead to access a chunk • can establish persistent TCP connection to a chunkserver • fewer metadata entries • metadata can be kept in memory (at master) • in-memory data structures allows fast periodic scanning • some potential problems with fragmentation • Metadata • file and chunk namespaces (files and chunk identifiers) • file-to-chunk mappings • locations of a chunk’s replicas

  17. Chunk Locations and Logs • Chunk location: • does not keep a persistent record of chunk locations • polls chunkservers at startup, and use heartbeat messages to monitor chunkservers: simplicity! • because of chunkserver failures, it is hard to keep persistent record of chunk locations • on-demand approach vs. coordination • on-demand wins when changes (failures) are often • Operation logs • maintains historical record of critical metadata changes • Namespace and mapping • for reliability and consistency, replicate operation log on multiple remote machines (“shadow masters”)

  18. Clients and APIs • GFS not transparent to clients • requires clients to perform certain “consistency” verification (using chunk id & version #), make snapshots (if needed), … • APIs: • open, delete, read, write (as expected) • append: at least once, possibly with gaps and/or inconsistencies among clients • snapshot: quickly create copy of file • Separation of data and control: • Issues control (metadata) requests to master server • Issues data requests directly to chunkservers • Caches metadata, but does no caching of data • no consistency difficulties among clients • streaming reads (read once) and append writes (write once) don’t benefit much from caching at client

  19. System Interaction: Read • Client sends master: • read(file name, chunk index) • Master’s reply: • chunk ID, chunk version#, locations of replicas • Client sends “closest” chunkserver w/replica: • read(chunk ID, byte range) • “closest” determined by IP address on simple rack-based network topology • Chunkserver replies with data

  20. System Interactions: Write and Record Append • Write and Record Append (atomic) • slightly different semantics: record append is “atomic” • The master grants a chunk lease to a chunkserver (primary), and replies back to client • Client first pushes data to all chunkservers • pushed linearly: each replica forwards as it receives • pipelined transfer: 13 MB/second with 100 Mbps network • Then issues a write/append to primary chunkserver • Primary chunkserver determines the order of updates to all replicas • in record append: primary chunkserver checks to see whether record append would exceed maximum chunk size • if yes, pad the chuck (and ask secondaries to do the same), and then ask client to append to the next chunk

  21. Leases and Mutation Order • Lease: • 60 second timeouts; can be extended indefinitely • extension request are piggybacked on heartbeat messages • after a timeout expires, master can grant new leases • Use leases to maintain consistent mutation order across replicas • Master grant lease to one of the replicas -> Primary • Primary picks serial order for all mutations • Other replicas follow the primary order

  22. Consistency Model • Changes to namespace (i.e., metadata) are atomic • done by single master server! • Master uses log to define global total order of namespace-changing operations • Relaxed consistency • concurrent changes are consistent but “undefined” • defined: after data mutation, file region that is consistent, and all clients see that entire mutation • an append is atomically committed at least once • occasional duplications • All changes to a chunk are applied in the same order to all replicas • Use version number to detect missed updates

  23. Master Namespace Management & Logs • Namespace: files and their chunks • metadata maintained as “flat names”, no hard/symbolic links • full path name to metadata mapping • with prefix compression • Each node in the namespace has associated read-write lock (-> a total global order, no deadlock) • concurrent operations can be properly serialized by this locking mechanism • Metadata updates are logged • logs replicated on remote machines • take global snapshots (checkpoints) to truncate logs (but checkpoints can be created while updates arrive) • Recovery • Latest checkpoint + subsequent log files

  24. Replica Placement • Goals: • Maximize data reliability and availability • Maximize network bandwidth • Need to spread chunk replicas across machines and racks • Higher priority to replica chunks with lower replication factors • Limited resources spent on replication

  25. Other Operations • Locking operations • one lock per path, can modify a directory concurrently • to access /d1/d2/leaf, need to lock /d1, /d1/d2, and /d1/d2/leaf • each thread acquires: a read lock on a directory & a write lock on a file • totally ordered locking to prevent deadlocks • Garbage Collection: • simpler than eager deletion due to • unfinished replicated creation, lost deletion messages • deleted files are hidden for three days, then they are garbage collected • combined with other background (e.g., take snapshots) ops • safety net against accidents

  26. Fault Tolerance and Diagnosis • Fast recovery • Master and chunkserver are designed to restore their states and start in seconds regardless of termination conditions • Chunk replication • Data integrity • A chunk is divided into 64-KB blocks • Each with its checksum • Verified at read and write times • Also background scans for rarely used data • Master replication • Shadow masters provide read-only access when the primary master is down

  27. GFS: Summary • GFS is a distributed file system that support large-scale data processing workloads on commodity hardware • GFS has different points in the design space • Component failures as the norm • Optimize for huge files • Success: used actively by Google to support search service and other applications • But performance may not be good for all apps • assumes read-once, write-once workload (no client caching!) • GFS provides fault tolerance • Replicating data (via chunk replication), fast and automatic recovery • GFS has the simple, centralized master that does not become a bottleneck • Semantics not transparent to apps (“end-to-end” principle?) • Must verify file contents to avoid inconsistent regions, repeated appends (at-least-once semantics)

  28. Google MapReduce • The problem • Many simple operations in Google • Grep for data, compute index, compute summaries, etc • But the input data is large, really large • The whole Web, billions of Pages • Google has lots of machines (clusters of 10K etc) • Many computations over VERY large datasets • Question is: how do you use large # of machines efficiently? • Can reduce computational model down to two steps • Map: take one operation, apply to many many data tuples • Reduce: take result, aggregate them • MapReduce • A generalized interface for massively parallel cluster processing

  29. MapReduce Programming Model • Intuitively just like those from functional languages • Scheme, lisp, haskell, etc • Map: initial parallel computation • map (in_key, in_value) -> list(out_key, intermediate_value) • In: a set of key/value pairs • Out: a set of intermediate key/value pairs • Note keys might change during Map • Reduce: aggregation of intermediate values by key • reduce (out_key, list(intermediate_value)) -> list(out_value) • Combines all intermediate values for a particular key • Produces a set of merged output values (usually just one)

  30. Example: Word Counting • Goal • Count # of occurrences of each word in many documents • Sample data • Page 1: the weather is good • Page 2: today is good • Page 3: good weather is good • So what does this look like in MapReduce? map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));

  31. map feed reduce Map/Reduce in Action • Worker 1: • (the 1) • Worker 2: • (is 1), (is 1), (is 1) • Worker 3: • (weather 1), (weather 1) • Worker 4: • (today 1) • Worker 5: • (good 1), (good 1), (good 1), (good 1) • Page 1: the weather is good • Page 2: today is good • Page 3: good weather is good • Worker 1: • (the 1), (weather 1), (is 1), (good 1) • Worker 2: • (today 1), (is 1), (good 1) • Worker 3: • (good 1), (weather 1), (is 1), (good 1) • Worker 1: (the 1) • Worker 2: (is 3) • Worker 3: (weather 2) • Worker 4: (today 1) • Worker 5: (good 4)

  32. Illustration

  33. More Examples • Distributed Grep • Map: emit line if matches given pattern P • Reduce: identity function, just copy result to output • Count of URL Access Frequency • Map: parses URL access logs, outputs <URL, 1> • Reduce: adds together counts for the same unique URLoutputs <URL, totalCount> • Reverse web-link graph: who links to this page? • Map: go through all source pages, generate all links<target, source> • Reduce: for each target, concatenate all source links<target, list (sources)> • Many more examples, see paper [MapReduce, OSDI’04]

  34. Single Master node Many worker bees Many worker bees MapReduce Architecture

  35. MapReduce Operation Master informed ofresult locations Initial data split into 64MB blocks M sends datalocation to R workers Computed, resultslocally stored Final output written

  36. What if Workers Die? • And you know they will… • Masters periodically ping workers • Still alive and working? Good… • If corpse found… • Allocate task to next idle worker (ruthless!) • If Map worker dies, need to recompute all its data, why? • If corpse comes back to life… (zombies!) • Give it a task, and clean slate • What if the Master dies? • Only 1 Master, he/she dies, the whole thing stops • Fairly rare occurrence

  37. What if You Find Stragglers? • Some workers can be slower than others • Faulty hardware • Software misconfiguration / bug • Whatever … • Near completion of task • Master looks at stragglers and their tasks • Assigns “backup” workers to also compute these tasks • Whoever finishes first wins! • Can now leave stragglers behind!

  38. What if You Find Stragglers? • Some workers can be slower than others • Faulty hardware • Software misconfiguration / bug • Whatever … • Near completion of task • Master looks at stragglers and their tasks • Assigns “backup” workers to also compute these tasks • Whoever finishes first wins! • Can now leave stragglers behind!

  39. Optional Materials:Google BigTable

  40. Google Bigtable • Distributed multi-level map • With an interesting data model • Fault-tolerant, persistent • Scalable • Thousands of servers • Terabytes of in-memory data • Petabyte of disk-based data • Millions of reads/writes per second, efficient scans • Self-managing • Servers can be added/removed dynamically • Servers adjust to load imbalance • Key points: • Data Model and Implementation Structure • Tablets, SSTables, compactions, locality groups, … • API and Details: shared logs, compression, replication, …

  41. “contents” COLUMNS ROWS … t1 … www.cnn.com t2 “<html>…” TIMESTAMPS t3 Basic Data Model • Distributed multi-dimensional sparse map (row, column, timestamp)  cell contents • Good match for most of Google’s applications

  42. Rows • A row key is an arbitrary string • Typically 10-100 bytes in size, up to 64 KB. • Every read or write of data under a single row is atomic • Data is maintained in lexicographic order by row key • The row range for a table is dynamically partitioned • Each partition (row range) is named a tablet • Unit of distribution and load-balancing. • Objective: make read operations single-sited! • E.g., In Webtable, pages in the same domain are grouped together by reversing the hostname components of the URLs: com.google.maps instead of maps.google.com.

  43. “contents:” “anchor:cnnsi.com” “anchor:stanford.edu” cnn.com “…” “CNN homepage” “CNN” Columns • Columns have two-level name structure: • Family: optional_qualifier; e.g., Language:English • Column family • A column family must be created before data can be stored in a column key • Unit of access control • Has associated type information • Qualifier gives unbounded columns • Additional level of indexing, if desired

  44. Locality Groups • column families can be assigned to a locality group • Used to organize underlying storage representation for performance • data in a locality group can be mapped in memory, and stored in SSTable • Avoid mingling data, e.g. page contents and page metadata • Can compress locality groups • Bloom Filters on SSTables in a locality group • avoid searching SSTable if bit not set • Tablet movement • Major compaction (with concurrent updates) • Minor compaction (to catch up with updates) without any concurrent updates • Load on new server without requiring any recovery action

  45. Timestamps (64 bit integers) • Used to store different versions of data in a cell • New writes default to current time, but timestamps for writes can also be set explicitly by clients • Assigned by: • Bigtable: real-time in microseconds, • client application: when unique timestamps are a necessity. • Items in a cell are stored in decreasing timestamp order • Application specifies how many versions (n) of data items are maintained in a cell. • Bigtable garbage collects obsolete versions • Lookup options: • “Return most recent K values” • “Return all values in timestamp range (or all values)” • Column families can be marked w/ attributes: • “Only retain most recent K values in a cell” • “Keep values until they are older than K seconds”

  46. Tablets • Large tables broken into tablets at row boundaries • Tablet holds contiguous range of rows • Clients can often choose row keys to achieve locality • Aim for ~100MB to 200MB of data per tablet • Serving machine responsible for ~100 tablets • Fast recovery: • 100 machines each pick up 1 tablet from failed machine • Fine-grained load balancing • Migrate tablets away from overloaded machine • Master makes load-balancing decisions

  47. “language” “contents” aaa.com cnn.com EN “<html>…” cnn.com/sports.html TABLETS … Website.com … Zuppa.com/menu.html Tablets & Splitting

  48. “language” “contents” aaa.com cnn.com EN “<html>…” cnn.com/sports.html TABLETS … Website.com … Yahoo.com/kids.html Yahoo.com/kids.html?D … Zuppa.com/menu.html Tablets & Splitting

  49. Table, Tablet and SSTable • Multiple tablets make up the table • SSTables can be shared • Tablets do not overlap, SSTables can overlap Tablet Tablet apple boat aardvark apple_two_E SSTable SSTable SSTable SSTable

  50. Read Write buffer in memory(random-access) Append-only log on GFS Write SSTable on GFS SSTable on GFS SSTable on GFS (mmap) Tablet Tablet Representation • SSTable: Immutable on-disk ordered map from stringstring • String keys: <row, column, timestamp> triples

More Related