1 / 29

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters. Hung- chih Yang 1 , Ali Dasdan 1 Ruey -Lung Hsiao 2 , D. Stott Parker 2 Yahoo! 1 Computer Science Department, UCLA 2 SIGMOD 2007, Beijing, China Presented by Jongheum Yeon , 2009. 08. 13. Outline.

duff
Télécharger la présentation

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chihYang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2 Yahoo! 1 Computer Science Department, UCLA2 SIGMOD 2007, Beijing, China Presented by JongheumYeon, 2009. 08. 13.

  2. Outline Introduction Map-Reduce Map-Reduce-Merge Conclusions

  3. Introduction New data-processing systems should consider alternatives to using big, traditional databases Map-Reduce does a good job, in a limited context, with extraordinary simplicity Map-Reduce-Merge will try to extend the applicability without giving up too much simplicity

  4. Introduction (cont’d) Application SQL Sawzall ≈SQL LINQ, SQL DryadLINQScope Parallel Databases Sawzall Pig, Hive Language Map-Reduce Hadoop Dryad Execution GFSBigTable HDFS S3 Cosmos AzureSQL Server Storage

  5. Map-Reduce : Motivation • Many special purpose tasks that operate on and produce large amounts of data • Crawled documents, web requests, etc • Inverted indices, summaries, other kinds of derived data • Needs to be distributed across large number of machines to finish in a reasonable time • Parallelize the computation • Distribute data • Obscures original computation with these extra concerns

  6. Map-Reduce : Benefits • Automatic parallelization and distribution • User code complexity and size reduced • Transparent fault-tolerance • I/O scheduling • Fine grained partitioning of tasks • Dynamically scheduled on available workers • Status and monitoring

  7. Map-Reduce : Programming Model • Input & Output: each a set of key/value pairs • Programmer specifies two functions: • map (in_key, in_value) -> list (out_key, intermediate_value) • Processes input key/value pair • Produces set of intermediate pairs • reduce (out_key, list(intermediate_value)) -> list (out_value) • Produces a set of merged output values (usually just one)

  8. Map-Reduce : Data Flow Data Map Reduce Data Map Reduce Data Map

  9. Map-Reduce : Data Flow KeyA ValueX Key1 Value1 Map Reduce A=X KeyB ValueY Key1 Value1 Map Reduce B=Y,Z KeyB ValueZ Map : Generate new Key and its value Reduce : Integrate values of same key

  10. Map-Reduce : Architecture Master Worker Worker Map GFS GFS Reduce Worker Worker Reduce Map

  11. Map-Reduce : Architecture • Master • Assigns and maintains the state of each map/reduce task • Propagating intermediate files to reduce tasks • Worker • Execute Map or Reduce by request of Master

  12. Map-Reduce : Distributed Processing Input File Map Map … Map Intermediate File 1 2 … 1 2 … R … 2 … R … Output File Output 1 Output 2 Output R …

  13. Map-Reduce : Example Inverted Index DocID=1 IDS 연구실의 페이지 DocID=2 IDB 연구실의 페이지 Inverted Index

  14. Map-Reduce : Example (cont’d) Data Map Reduce Data Map Reduce Data Map Input data to Map Output of Map

  15. Map-Reduce : Example (cont’d) Data Map Reduce Data Map Reduce Data Map • Shuffle • Collect same keys and convey them to Reduce • Reduce writes the final result

  16. Map-Reduce : Example (cont’d) • Other Examples • Distributed Grep • Count URL Access Frequency • <URL, 1> • <URL, total count> • Reverse Web-Link Graph • <target, source> • <target, list(source)>

  17. Map-Reduce-Merge Map-Reduce is an extremely simple model, but with limited context Map-Reduce handles mainly homogeneous datasets Relational operators are hard to implement with Map-Reduce(especially join operations) Map-Reduce-Merge tries to keep the simplicity of Map-Reduce while extending it to be more complete

  18. Map-Reduce-Merge • Adds a merge phase to the Map-Reduce algorithm • Allows processing of multiple heterogeneous datasets • Like Map and Reduce, the Merge phase is implemented by the developer • Example: • Two datasets: department and employee • Goal: compute employee’s bonus based on individual rewardsand department bonus adjustment

  19. Map-Reduce-Merge • Example • Match keys on dept_id in tables

  20. Map-Reduce-Merge: Extending Map-Reduce • Change to reduce phase / Merge phase • Phases • 1. Map: (k1, v1) → [(k2, v2)] • 2. Reduce: (k2, [v2]) → [v3] • becomes: • 1. Map: (k1, v1) → [(k2, v2)] • 2. Reduce: (k2, [v2]) → (k2, [v3]) • 3. Merge: ((k2, [v3]), (k3, [v4])) → (k4, [v5])

  21. Map-Reduce-Merge • Additional user-definable operations • Merger: same principle as map and reduce • analogous to the map and reduce definitions, define logic to do the merge operation • Processor: processes data from one source • process data on an individual source • Partition selector: selects the data that should go to the merger • which data should go to which merger? • Configurable iterator: how to iterate through each list as the merging is done • how to step through each of the lists as you merge

  22. Map-Reduce-Merge

  23. Map-Reduce-Merge : Relational Data Processing • Relational operators can be implemented using the Map-Reduce-Merge model. This includes: • Projection • Aggregation • Generalized selection • Joins • Set union • Set intersection • Set difference • Etc…

  24. Map-Reduce-Merge : Example, Set Union • The two Map-Reduces emit each a sorted list of unique elements • The Merge merges the two lists by iterating in the following way: • Store the smallest value of two and increase it’s iterator by one • If they are equal, store one of them and increase both iterators

  25. Map-Reduce-Merge : Example, Set Difference • We have two sets, A and B, we want to compute A-B • The two Map-Reduces emit each a sorted list of unique elements • The merge iterates simultaneously over the two lists: • If the value of A is less than B’s, store A’s value • If the value of B is smaller, increment B’s iterator • If the two are equal, increment both iterators

  26. Map-Reduce-Merge : Example, Sort-Merge Join Map: partition records into buckets which are mutually exclusive and each key range is assigned to a reducer Reduce: data in the sets are merged into a sorted set => sort the data Merge: the merger joins the sorted data for each key range

  27. Map-Reduce-Merge : Optimizations Map-reduce already optimizes using locality and backup tasks Optimize the number of connections between the outputs of the reduce phase and the input of the merge phase ( Example: Set intersection) Combining two phases into one (example: ReduceMerge)

  28. Conclusions Map-Reduce-Merge allows us to work on heterogeneous datasets Map-Reduce-Merge supports joins which Map-reduce didn’t directly do Nextstep: develop an SQL-like interface and an optimizer which simplifies the development of a Map-reduce-Merge workflow

More Related