Mass Data Processing Technology on Large Scale Clusters

Mass Data Processing Technology on Large Scale Clusters For the class of Advanced Computer Architecture All course material (slides, labs, etc) is licensed under the Creative Commons Attribution 2.5 License . Many thanks to Aaron Kimball & Sierra Michels-Slettvet for their original version Some slides are from the Internet.

Outline

Four Papers • Luiz Barroso, Jeffrey Dean, and Urs Hoelzle, Web Search for a Planet: The Google Cluster Architecture, IEEE MACRO, 2003 • Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The Google File System, 19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003. • Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004. • Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber, Bigtable: A Distributed Storage System for Structured Data, OSDI'06: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, November, 2006.

Introduction to Distributed Systems

Computer Speedup Why slow down here? Then, How to improve the performance? Moore’s Law: “The density of transistors on a chip doubles every 18 months, for the same cost” (1965) Image: Tom’s Hardware

Scope of Problems

Distributed Problems • Rendering multiple frames of high-quality animation Image: DreamWorks Animation

Distributed Problems • Simulating several hundred or thousand characters Happy Feet © Kingdom Feature Productions; Lord of the Rings © New Line Cinema

Distributed Problems • Indexing the web (Google) • Simulating an Internet-sized network for networking experiments (PlanetLab) • Speeding up content delivery (Akamai) What is the key attribute that all these examples have in common?

PlanetLab PlanetLab is a global research network that supports the development of new network services. PlanetLab currently consists of 809 nodes at 401 sites.

CDN - Akamai

Parallel vs. Distributed • Parallel computing can mean: • Vector processing of data (SIMD) • Multiple CPUs in a single computer (MIMD) • Distributed computing is multiple CPUs across many computers (MIMD)

A Brief History… 1975-85 • Parallel computing was favored in the early years • Primarily vector-based at first • Gradually more thread-based parallelism was introduced Cray 2 supercomputer (Wikipedia)

A Brief History… 1985-95 • “Massively parallel architectures” start rising in prominence • Message Passing Interface (MPI) and other libraries developed • Bandwidth was a big problem

A Brief History… 1995-Today • Cluster/grid architecture increasingly dominant • Special node machines eschewed in favor of COTS technologies • Web-wide cluster software • Companies like Google take this to the extreme (10,000 node clusters)

Top 500, Architecture

Top 500 Trends

Distributed System Concepts • Multi-Thread Program • Synchronization • Semaphores, Conditional Variables, Barriers • Network Concepts • TCP/IP, Sockets, Ports • RPC, Remote Invocation, RMI • Synchronous, Asynchronous, Non-Blocking • Transaction Processing System • P2P, Grid

Semaphores • A semaphore is a flag that can be raised or lowered in one step • Semaphores were flags that railroad engineers would use when entering a shared track Only one side of the semaphore can ever be red! (Can both be green?)

Barriers • A barrier knows in advance how many threads it should wait for. Threads “register” with the barrier when they reach it, and fall asleep. • Barrier wakes up all registered threads when total count is correct • Pitfall: What happens if a thread takes a long time? Barrier

Synchronous RPC

Asynchronous RPC

Asynchronous RPC 2: Callbacks

Google Infrastructure

Early Google System

Spring 2000 Design

Late 2000 Design

Spring 2001 Design

Empty Google Cluster

Three Days Later…

A Picture is Worth…

The Google Infrastructure • >200,000 commodity Linux servers; • Storage capacity >5 petabytes; • Indexed >8 billion web pages; • Capital and operating costs at fraction of large scale commercial servers; • Traffic growth 20-30%/month.

Dimensions of a Google Cluster • 359 racks • 31,654 machines • 63,184 CPUs • 126,368 Ghz of processing power • 63,184 Gb of RAM • 2,527 Tb of Hard Drive space • Appx. 40 million searches/day

Architecture for Reliability • Replication (3x +) for redundancy; • Replication for proximity and response; • Fault tolerant software for cheap hardware. • Policy: Reliability through software architecture, not hardware.

Query Serving Infrastructure • Processing a query may engage 1000+ servers; • Index Servers manage distributed files; • Document Servers access distributed data; • Response time = <0.25 seconds anywhere.

Systems Engineering Principles • Overwhelm problems with computational power; • Impose standard file management; • Manage through standard job scheduling; • Apply simplified data processing discipline.

Scalable Engineering Infrastructure • Goal: Create very large scale, high performance computing infrastructure • Hardware + software systems to make it easy to build products • Focus on price/performance, and ease of use • Enables better products • Allows rapid experimentation with large data sets with very simple programs allows algorithms to be innovated and evolved with real world data • Scalable Serving capacity • Design to run on lots of cheap failure prone hardware • If a service gets a lot of traffic, you simply add servers and bandwidth. • Every engineer creates software that scales, monitors itself and recovers from ground up • The net result is that every service and every reusable component embodies these properties and when something succeeds, it has room to fly. • Google • GFS, MapReduce and Bigtable are the fundamental building blocks • indices containing more documents • updated more often • faster queries • faster product development cycles • …

Rethinking Development Practices • Build on your own API • Develop the APIs first • Build your own application using the APIs – you know it works! • Take a call on which of these you would expose for external developers • Sampling and Testing • Release early and iterate • Continuous User Feedback • Public Beta • Open to all – not to a limited set of users • Potentially years of beta – not a fixed timeline

Distributed File Systems and The Google File System

Outline

File Systems Overview • System that permanently stores data • Usually layered on top of a lower-level physical storage medium • Divided into logical units called “files” • Addressable by a filename (“foo.txt”) • Usually supports hierarchical nesting (directories) • A file path joins file & directory names into a relative or absolute address to identify a file (“/home/aaron/foo.txt”)

What Gets Stored • User data itself is the bulk of the file system's contents • Also includes meta-data on a drive-wide and per-file basis: Drive-wide: Available space Formatting info character set ... Per-file: name owner modification date physical layout...

High-Level Organization • Files are organized in a “tree” structure made of nested directories • One directory acts as the “root” • “links” (symlinks, shortcuts, etc) provide simple means of providing multiple access paths to one file • Other file systems can be “mounted” and dropped in as sub-hierarchies (other drives, network shares)

Low-Level Organization (1/2) • File data and meta-data stored separately • File descriptors + meta-data stored in inodes • Large tree or table at designated location on disk • Tells how to look up file contents • Meta-data may be replicated to increase system reliability

Low-Level Organization (2/2) • “Standard” read-write medium is a hard drive (other media: CDROM, tape, ...) • Viewed as a sequential array of blocks • Must address ~1 KB chunk at a time • Tree structure is “flattened” into blocks • Overlapping reads/writes/deletes can cause fragmentation: files are often not stored with a linear layout • inodes store all block numbers related to file

Fragmentation

Design Considerations • Smaller inode size reduces amount of wasted space • Larger inode size increases speed of sequential reads (may not help random access) • Should the file system be faster or morereliable? • But faster at what: Large files? Small files? Lots of reading? Frequent writers, occasional readers?

Distributed Filesystems • Support access to files on remote servers • Must support concurrency • Make varying guarantees about locking, who “wins” with concurrent writes, etc... • Must gracefully handle dropped connections • Can offer support for replication and local caching • Different implementations sit in different places on complexity/feature scale

NFS • First developed in 1980s by Sun • Presented with standard UNIX FS interface • Network drives are mounted into local directory hierarchy

Mass Data Processing Technology on Large Scale Clusters

Mass Data Processing Technology on Large Scale Clusters

Presentation Transcript

MapReduce: Simplified Data Processing on Large Clusters

MapReduce: Simplified Data Processing on Large Clusters

MapReduce: simplified data processing on large clusters

HaLoop: Efficient Iterative Data Processing On Large Scale Clusters

MapReduce : Simplified Data Processing on Large Clusters

MapReduce : Simpliyed Data Processing on Large Clusters

MapReduce Simplified Data Processing on Large Clusters

MapReduce : Simplified Data Processing on Large Clusters

MapReduce : Simplified Data Processing on Large Clusters

MapReduce : Simplified Data Processing on Large Clusters

MapReduce : Simplified Data Processing on Large Clusters

MapReduce : Simplified Data Processing on Large Clusters

MapReduce : Simplified Data Processing on Large Clusters

MapReduce : Simplified Data Processing on Large Clusters

HaLoop: Efficient Iterative Data Processing On Large Scale Clusters

MapReduce : Simplified Data Processing on Large Clusters

MapReduce: Simplified Data Processing on Large Clusters

MapReduce : Simplified Data Processing on Large Clusters

MapReduce: simplified data processing on large clusters

MapReduce: Simplied Data Processing on Large Clusters

Large scale data processing

MapReduce: Simplified Data Processing on Large Clusters