Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chihYang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2 Yahoo! 1 Computer Science Department, UCLA2 SIGMOD 2007, Beijing, China Presented by JongheumYeon, 2009. 08. 13.

Outline Introduction Map-Reduce Map-Reduce-Merge Conclusions

Introduction New data-processing systems should consider alternatives to using big, traditional databases Map-Reduce does a good job, in a limited context, with extraordinary simplicity Map-Reduce-Merge will try to extend the applicability without giving up too much simplicity

Introduction (cont’d) Application SQL Sawzall ≈SQL LINQ, SQL DryadLINQScope Parallel Databases Sawzall Pig, Hive Language Map-Reduce Hadoop Dryad Execution GFSBigTable HDFS S3 Cosmos AzureSQL Server Storage

Map-Reduce : Motivation • Many special purpose tasks that operate on and produce large amounts of data • Crawled documents, web requests, etc • Inverted indices, summaries, other kinds of derived data • Needs to be distributed across large number of machines to finish in a reasonable time • Parallelize the computation • Distribute data • Obscures original computation with these extra concerns

Map-Reduce : Benefits • Automatic parallelization and distribution • User code complexity and size reduced • Transparent fault-tolerance • I/O scheduling • Fine grained partitioning of tasks • Dynamically scheduled on available workers • Status and monitoring

Map-Reduce : Programming Model • Input & Output: each a set of key/value pairs • Programmer specifies two functions: • map (in_key, in_value) -> list (out_key, intermediate_value) • Processes input key/value pair • Produces set of intermediate pairs • reduce (out_key, list(intermediate_value)) -> list (out_value) • Produces a set of merged output values (usually just one)

Map-Reduce : Data Flow Data Map Reduce Data Map Reduce Data Map

Map-Reduce : Data Flow KeyA ValueX Key1 Value1 Map Reduce A=X KeyB ValueY Key1 Value1 Map Reduce B=Y,Z KeyB ValueZ Map : Generate new Key and its value Reduce : Integrate values of same key

Map-Reduce : Architecture Master Worker Worker Map GFS GFS Reduce Worker Worker Reduce Map

Map-Reduce : Architecture • Master • Assigns and maintains the state of each map/reduce task • Propagating intermediate files to reduce tasks • Worker • Execute Map or Reduce by request of Master

Map-Reduce : Distributed Processing Input File Map Map … Map Intermediate File 1 2 … 1 2 … R … 2 … R … Output File Output 1 Output 2 Output R …

Map-Reduce : Example Inverted Index DocID=1 IDS 연구실의 페이지 DocID=2 IDB 연구실의 페이지 Inverted Index

Map-Reduce : Example (cont’d) Data Map Reduce Data Map Reduce Data Map Input data to Map Output of Map

Map-Reduce : Example (cont’d) Data Map Reduce Data Map Reduce Data Map • Shuffle • Collect same keys and convey them to Reduce • Reduce writes the final result

Map-Reduce : Example (cont’d) • Other Examples • Distributed Grep • Count URL Access Frequency • <URL, 1> • <URL, total count> • Reverse Web-Link Graph • <target, source> • <target, list(source)>

Map-Reduce-Merge Map-Reduce is an extremely simple model, but with limited context Map-Reduce handles mainly homogeneous datasets Relational operators are hard to implement with Map-Reduce(especially join operations) Map-Reduce-Merge tries to keep the simplicity of Map-Reduce while extending it to be more complete

Map-Reduce-Merge • Adds a merge phase to the Map-Reduce algorithm • Allows processing of multiple heterogeneous datasets • Like Map and Reduce, the Merge phase is implemented by the developer • Example: • Two datasets: department and employee • Goal: compute employee’s bonus based on individual rewardsand department bonus adjustment

Map-Reduce-Merge • Example • Match keys on dept_id in tables

Map-Reduce-Merge: Extending Map-Reduce • Change to reduce phase / Merge phase • Phases • 1. Map: (k1, v1) → [(k2, v2)] • 2. Reduce: (k2, [v2]) → [v3] • becomes: • 1. Map: (k1, v1) → [(k2, v2)] • 2. Reduce: (k2, [v2]) → (k2, [v3]) • 3. Merge: ((k2, [v3]), (k3, [v4])) → (k4, [v5])

Map-Reduce-Merge • Additional user-definable operations • Merger: same principle as map and reduce • analogous to the map and reduce definitions, define logic to do the merge operation • Processor: processes data from one source • process data on an individual source • Partition selector: selects the data that should go to the merger • which data should go to which merger? • Configurable iterator: how to iterate through each list as the merging is done • how to step through each of the lists as you merge

Map-Reduce-Merge

Map-Reduce-Merge : Relational Data Processing • Relational operators can be implemented using the Map-Reduce-Merge model. This includes: • Projection • Aggregation • Generalized selection • Joins • Set union • Set intersection • Set difference • Etc…

Map-Reduce-Merge : Example, Set Union • The two Map-Reduces emit each a sorted list of unique elements • The Merge merges the two lists by iterating in the following way: • Store the smallest value of two and increase it’s iterator by one • If they are equal, store one of them and increase both iterators

Map-Reduce-Merge : Example, Set Difference • We have two sets, A and B, we want to compute A-B • The two Map-Reduces emit each a sorted list of unique elements • The merge iterates simultaneously over the two lists: • If the value of A is less than B’s, store A’s value • If the value of B is smaller, increment B’s iterator • If the two are equal, increment both iterators

Map-Reduce-Merge : Example, Sort-Merge Join Map: partition records into buckets which are mutually exclusive and each key range is assigned to a reducer Reduce: data in the sets are merged into a sorted set => sort the data Merge: the merger joins the sorted data for each key range

Map-Reduce-Merge : Optimizations Map-reduce already optimizes using locality and backup tasks Optimize the number of connections between the outputs of the reduce phase and the input of the merge phase ( Example: Set intersection) Combining two phases into one (example: ReduceMerge)

Conclusions Map-Reduce-Merge allows us to work on heterogeneous datasets Map-Reduce-Merge supports joins which Map-reduce didn’t directly do Nextstep: develop an SQL-like interface and an optimizer which simplifies the development of a Map-reduce-Merge workflow

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters