Distributed File Systems Overview of the DFS Ecology MapReduce and Hadoop

The MapReduce Environment Distributed File SystemsOverview of the DFS EcologyMapReduceand Hadoop Jeffrey D. Ullman Stanford University

Distributed File Systems Chunking Replication Distribution on Racks

Commodity Clusters • Datasets can be very large. • Tens to hundreds of terabytes. • Cannot process on a single server. • Standard architecture emerging: • Cluster of commodity Linux nodes (compute nodes). • Gigabit Ethernet interconnect. • How to organize computations on this architecture? • Mask issues such as hardware failure.

CPU CPU CPU CPU Mem Mem Mem Mem Disk Disk Disk Disk Cluster Architecture 2-10 Gbps backbone between racks 1 Gbps between any pair of nodes in a rack Switch Switch Switch … … Each rack contains 16-64 nodes

Stable Storage • First order problem: if nodes can fail, how can we store data persistently? • Answer: Distributed File System. • Provides global file namespace. • Examples: Google GFS, Colossus; HadoopHDFS. • Typical usage pattern: • Huge files. • Data is rarely updated in place. • Reads and appends are common.

Distributed File System • Chunk Servers. • File is split into contiguous chunks, typically 64MB. • Each chunk replicated (usually 2x or 3x). • Try to keep replicas in different racks. • Alternative: Erasure coding. • Master Node for a file. • Stores metadata, location of all chunks. • Possibly replicated.

Compute Nodes • Organized into racks. • Intra-rack connection typically gigabit speed. • Inter-rack connection faster by a small factor.

File Chunks Racks of Compute Nodes

3-way replication of files, with copies on different racks.

Above the DFS MapReduce Key-Value Stores SQL Implementations

The New Stack SQL Implementations, e.g., PIG (relational algebra), HIVE Object Store (key-value store), e.g., BigTable, Hbase, Cassandra MapReduce, e.g. Hadoop Distributed File System

MapReduceSystems • MapReduce(Google) and open-source (Apache) equivalent Hadoop. • Important specialized parallel computing tool. • Cope with compute-node failures. • Avoid restart of the entire job.

Key-Value Stores • BigTable(Google), Hbase, Cassandra (Apache), Dynamo(Amazon). • Each row is a key plus values over a flexible set of columns. • Each column component can be a set of values. • Example: Structure of the Web. • Key is a URL. • One column is a set of URL’s – those linked to the page represented by the key. • A second column is the set of URL’s linking to the key.

SQL-Like Systems • PIG – Yahoo! implementation of relational algebra. • Translates to a sequence of map-reduce operations, using Hadoop. • Hive – open-source (Apache) implementation of a restricted SQL, called QL, over Hadoop.

SQL-Like Systems – (2) • Sawzall – Google implementation of parallel select + aggregation, but using C++. • Dremel – (Google) real restricted SQL, column oriented store. • F1 – (Google) row-oriented, conventional, but massive scale. • Scope – Microsoft implementation of restricted SQL.

MapReduce Formal Definition Implementation Fault-Tolerance Examples: Word-Count, Join

MapReduce • Input: a set of key/value pairs. • User supplies two functions: • map(k,v)  set(k1,v1) • reduce(k1, list(v1))  set(v2) • Technically, the input consists of key-value pairs of some type, but usually only the value is important. • (k1,v1) is an intermediate key/value pair. • Output is the set of (k1,v2) pairs.

Map Tasks and Reduce Tasks • MapReduce job = • Map function (inputs -> key-value pairs) + • Reduce function (key and list of values -> outputs). • Map and Reduce Tasks apply Map or Reduce function to (typically) many of their inputs. • Unit of parallelism.

Behind the Scenes • The Map tasks generate key-value pairs. • Each takes one or more chunks of input from the distributed file system. • The system takes all the key-value pairs from all the Map tasks and sorts them by key. • Then, it forms key-(list-of-associated-values) pairs and passes each key-(value-list) pair to one of the Reduce tasks.

MapReducePattern “key”-value pairs Input from DFS Output to DFS Map tasks Reduce tasks

Example: Word Count • We have a large file documents, which are sequences of words. • Count the number of times each distinct word appears in the file.

Word Count Using MapReduce map(key, value): // key: document name; value: text of document FOR (each word w in value) emit(w, 1); reduce(key, value-list): // key: a word; value: an iterator over value-list result = 0; FOR (each count v on value-list) result += v; emit(result);

fork fork fork Master assign map assign reduce Input Data Worker Output File 0 write Worker local write Chunk 0 read Worker Chunk1 Output File 1 Chunk 2 Worker Worker remote read, sort Distributed Execution Overview User Program

Data Management • Input and final output are stored in the distributed file system. • Scheduler tries to schedule Map tasks “close” to physical storage location of input data – preferably at the same node. • Intermediate results are stored on local file storage of Map and Reduce workers.

The Master Task • Maintain task status: (idle, active, completed). • Idle tasks get scheduled as workers become available. • When a Map task completes, it sends the Master the location and sizes of its intermediate files, one for each Reduce task. • Master pushes location of intermediates to Reduce tasks. • Master pings workers periodically to detect failures.

How Many Map and Reduce Tasks? • Rule of thumb: Use several times more Map tasks and Reduce tasks than the number of compute nodes available. • Minimizes skew caused by different tasks taking different amounts of time. • One DFS chunk per Map task is common.

Combiners • Often a Map task will produce many pairs of the form (k,v1), (k,v2), … for the same key k. • E.g., popular words in Word Count. • Can save communication time by applying Reduce function to values with the same key at the Map task. • Called a combiner. • Works only if Reduce function is commutative and associative.

Partition Function • We need to assure that records with the same intermediate key end up at the same Reduce task. • System uses a default partition function e.g., hash(key) mod R, if there are R Reduce tasks. • Sometimes useful to override. • Example: hash(hostname(URL)) mod R ensures URLs from a host end up at the same Reduce task and therefore appear together in the output.

Coping With Failures • MapReduceis designed to deal with compute nodes failing to execute a task. • Re-executes failed tasks, not whole jobs. • Failure modes: • Compute-node failure (e.g., disk crash). • Rack communication failure. • Software failures, e.g., a task requires Java n; node has Java n-1.

Things MapReduceis Good At • Matrix-Matrix and Matrix-vector multiplication. • One step of the PageRank iteration was the original application. • Relational algebra operations. • We’ll do an example of the join. • Many other “embarrassingly parallel” operations.

Review of Terminology • Map-Reduce job = • Map function (inputs -> key-value pairs) + • Reduce function (key and list of values -> outputs). • Map and Reduce Tasks apply Map or Reduce function to (typically) many of their inputs. • Unit of parallelism. • Mapper = application of the Map function to a single input. • Reducer = application of the Reduce function to a single key-(list of values) pair.

Example: Natural Join • Join of R(A,B) with S(B,C) is the set of tuples (a,b,c) such that (a,b) is in R and (b,c) is in S. • Mappers need to send R(a,b) and S(b,c) to the same reducer, so they can be joined there. • Mapper output: key = B-value, value = relation and other component (A or C). • Example: R(1,2) -> (2, (R,1)) S(2,3) -> (2, (S,3))

Mapping Tuples Mapper for R(1,2) R(1,2) (2, (R,1)) Mapper for R(4,2) R(4,2) (2, (R,4)) Mapper for S(2,3) S(2,3) (2, (S,3)) Mapper for S(5,6) S(5,6) (5, (S,6))

Grouping Phase • There is a reducer for each key. • Every key-value pair generated by any mapper is sent to the reducer for its key.

Mapping Tuples Mapper for R(1,2) (2, (R,1)) Reducer for B = 2 Mapper for R(4,2) (2, (R,4)) Reducer for B = 5 Mapper for S(2,3) (2, (S,3)) Mapper for S(5,6) (5, (S,6))

Constructing Value-Lists • The input to each reducer is organized by the system into a pair: • The key. • The list of values associated with that key.

The Value-List Format Reducer for B = 2 (2, [(R,1), (R,4), (S,3)]) Reducer for B = 5 (5, [(S,6)])

The Reduce Function for Join • Given key b and a list of values that are either (R, ai) or (S, cj), output each triple (ai, b, cj). • Thus, the number of outputs made by a reducer is the product of the number of R’s on the list and the number of S’s on the list.

Output of the Reducers Reducer for B = 2 (2, [(R,1), (R,4), (S,3)]) (1,2,3), (4,2,3) Reducer for B = 5 (5, [(S,6)])

Distributed File Systems Overview of the DFS Ecology MapReduce and Hadoop

Distributed File Systems Overview of the DFS Ecology MapReduce and Hadoop

Presentation Transcript

Distributed File Systems

Distributed File Systems DFS

MapReduce and Hadoop Distributed File System

Hadoop Distributed File System

Hadoop Distributed File System

The Hadoop Distributed File System

Distributed File Systems

MapReduce and Hadoop

HDFS ( Hadoop Distributed File System)

Mapreduce and Hadoop

Hadoop MapReduce

Microsoft Distributed File System (Dfs)

Distributed File Systems

Distributed File Systems

Distributed File system(DFS)

HDFS Hadoop Distributed File System

MapReduce and Hadoop Distributed File System

The Hadoop Distributed File System

MapReduce and Hadoop Distributed File System

Distributed File Systems