Mass Data Processing Technology on Large Scale Clusters For the class of Advanced Computer Architecture All course material (slides, labs, etc) is licensed under the Creative Commons Attribution 2.5 License . Many thanks to Aaron Kimball & Sierra Michels-Slettvet for their original version Some slides are from the Internet.
Four Papers • Luiz Barroso, Jeffrey Dean, and Urs Hoelzle, Web Search for a Planet: The Google Cluster Architecture, IEEE MACRO, 2003 • Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The Google File System, 19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003. • Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004. • Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber, Bigtable: A Distributed Storage System for Structured Data, OSDI'06: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, November, 2006.
Computer Speedup Why slow down here? Then, How to improve the performance? Moore’s Law: “The density of transistors on a chip doubles every 18 months, for the same cost” (1965) Image: Tom’s Hardware
Distributed Problems • Rendering multiple frames of high-quality animation Image: DreamWorks Animation
Distributed Problems • Simulating several hundred or thousand characters Happy Feet © Kingdom Feature Productions; Lord of the Rings © New Line Cinema
Distributed Problems • Indexing the web (Google) • Simulating an Internet-sized network for networking experiments (PlanetLab) • Speeding up content delivery (Akamai) What is the key attribute that all these examples have in common?
PlanetLab PlanetLab is a global research network that supports the development of new network services. PlanetLab currently consists of 809 nodes at 401 sites.
Parallel vs. Distributed • Parallel computing can mean: • Vector processing of data (SIMD) • Multiple CPUs in a single computer (MIMD) • Distributed computing is multiple CPUs across many computers (MIMD)
A Brief History… 1975-85 • Parallel computing was favored in the early years • Primarily vector-based at first • Gradually more thread-based parallelism was introduced Cray 2 supercomputer (Wikipedia)
A Brief History… 1985-95 • “Massively parallel architectures” start rising in prominence • Message Passing Interface (MPI) and other libraries developed • Bandwidth was a big problem
A Brief History… 1995-Today • Cluster/grid architecture increasingly dominant • Special node machines eschewed in favor of COTS technologies • Web-wide cluster software • Companies like Google take this to the extreme (10,000 node clusters)
Distributed System Concepts • Multi-Thread Program • Synchronization • Semaphores, Conditional Variables, Barriers • Network Concepts • TCP/IP, Sockets, Ports • RPC, Remote Invocation, RMI • Synchronous, Asynchronous, Non-Blocking • Transaction Processing System • P2P, Grid
Semaphores • A semaphore is a flag that can be raised or lowered in one step • Semaphores were flags that railroad engineers would use when entering a shared track Only one side of the semaphore can ever be red! (Can both be green?)
Barriers • A barrier knows in advance how many threads it should wait for. Threads “register” with the barrier when they reach it, and fall asleep. • Barrier wakes up all registered threads when total count is correct • Pitfall: What happens if a thread takes a long time? Barrier
The Google Infrastructure • >200,000 commodity Linux servers; • Storage capacity >5 petabytes; • Indexed >8 billion web pages; • Capital and operating costs at fraction of large scale commercial servers; • Traffic growth 20-30%/month.
Dimensions of a Google Cluster • 359 racks • 31,654 machines • 63,184 CPUs • 126,368 Ghz of processing power • 63,184 Gb of RAM • 2,527 Tb of Hard Drive space • Appx. 40 million searches/day
Architecture for Reliability • Replication (3x +) for redundancy; • Replication for proximity and response; • Fault tolerant software for cheap hardware. • Policy: Reliability through software architecture, not hardware.
Query Serving Infrastructure • Processing a query may engage 1000+ servers; • Index Servers manage distributed files; • Document Servers access distributed data; • Response time = <0.25 seconds anywhere.
Systems Engineering Principles • Overwhelm problems with computational power; • Impose standard file management; • Manage through standard job scheduling; • Apply simplified data processing discipline.
Scalable Engineering Infrastructure • Goal: Create very large scale, high performance computing infrastructure • Hardware + software systems to make it easy to build products • Focus on price/performance, and ease of use • Enables better products • Allows rapid experimentation with large data sets with very simple programs allows algorithms to be innovated and evolved with real world data • Scalable Serving capacity • Design to run on lots of cheap failure prone hardware • If a service gets a lot of traffic, you simply add servers and bandwidth. • Every engineer creates software that scales, monitors itself and recovers from ground up • The net result is that every service and every reusable component embodies these properties and when something succeeds, it has room to fly. • Google • GFS, MapReduce and Bigtable are the fundamental building blocks • indices containing more documents • updated more often • faster queries • faster product development cycles • …
Rethinking Development Practices • Build on your own API • Develop the APIs first • Build your own application using the APIs – you know it works! • Take a call on which of these you would expose for external developers • Sampling and Testing • Release early and iterate • Continuous User Feedback • Public Beta • Open to all – not to a limited set of users • Potentially years of beta – not a fixed timeline
File Systems Overview • System that permanently stores data • Usually layered on top of a lower-level physical storage medium • Divided into logical units called “files” • Addressable by a filename (“foo.txt”) • Usually supports hierarchical nesting (directories) • A file path joins file & directory names into a relative or absolute address to identify a file (“/home/aaron/foo.txt”)
What Gets Stored • User data itself is the bulk of the file system's contents • Also includes meta-data on a drive-wide and per-file basis: Drive-wide: Available space Formatting info character set ... Per-file: name owner modification date physical layout...
High-Level Organization • Files are organized in a “tree” structure made of nested directories • One directory acts as the “root” • “links” (symlinks, shortcuts, etc) provide simple means of providing multiple access paths to one file • Other file systems can be “mounted” and dropped in as sub-hierarchies (other drives, network shares)
Low-Level Organization (1/2) • File data and meta-data stored separately • File descriptors + meta-data stored in inodes • Large tree or table at designated location on disk • Tells how to look up file contents • Meta-data may be replicated to increase system reliability
Low-Level Organization (2/2) • “Standard” read-write medium is a hard drive (other media: CDROM, tape, ...) • Viewed as a sequential array of blocks • Must address ~1 KB chunk at a time • Tree structure is “flattened” into blocks • Overlapping reads/writes/deletes can cause fragmentation: files are often not stored with a linear layout • inodes store all block numbers related to file
Design Considerations • Smaller inode size reduces amount of wasted space • Larger inode size increases speed of sequential reads (may not help random access) • Should the file system be faster or morereliable? • But faster at what: Large files? Small files? Lots of reading? Frequent writers, occasional readers?
Distributed Filesystems • Support access to files on remote servers • Must support concurrency • Make varying guarantees about locking, who “wins” with concurrent writes, etc... • Must gracefully handle dropped connections • Can offer support for replication and local caching • Different implementations sit in different places on complexity/feature scale
NFS • First developed in 1980s by Sun • Presented with standard UNIX FS interface • Network drives are mounted into local directory hierarchy