1 / 34

MapReduce

MapReduce. By: Jeffrey Dean & Sanjay Ghemawat. Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe. Paper. MapReduce : Simplified Data Processing on Large Clusters. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI' 04)

arwen
Télécharger la présentation

MapReduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MapReduce By: Jeffrey Dean & Sanjay Ghemawat Presented by: WarunikaRanaweera Supervised by: Dr. NalinRanasinghe

  2. Paper MapReduce: Simplified Data Processing on Large Clusters In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI' 04) Also appears in the Communications of the ACM (2008)

  3. Authors – Jeffrey Dean • Ph.D. in Computer Science – University of Washington • Google Fellow in Systems and Infrastructure Group • ACM Fellow • Research Areas: Distributed Systems and Parallel Computing

  4. Authors – Sanjay Ghemawat • Ph.D. in Computer Science – Massachusetts Institute of Technology • Google Fellow • Research Areas: Distributed Systems and Parallel Computing

  5. Large Computations • Calculate 30*50 Easy? • 30*50 + 31*51 + 32*52 + 33*52 + .... + 40*60 Little bit hard?

  6. Large Computations • Simple computation, but huge data set • Real world example for large computations • 20+ billion web pages * 20kB webpage • One computer reads 30/35 MB/sec from disc • Nearly four months to read the web

  7. Good News: Distributed Computation • Parallelize tasks in a distributed computing environment • Web page problem solved in 3 hours with 1000 machines

  8. Though, the bad news is... • Complexities in Distributed Computing • How to parallelize the computation? • Coordinate with other nodes • Handling failures • Preserve bandwidth • Load balancing

  9. MapReduce to the Rescue • A platform to hide the messy details of distributed computing • Which are, • Parallelization • Fault-tolerance • Data distribution • Load Balancing • A programming model • An implementation

  10. MapReduce: Programming Model • Example: Word count the 1 the 1 the 1 the 1 quick 1 brown 1 fox 1 the 1 fox 1 ate 1 the 1 mouse 1 • the quick • brown fox • the fox ate • the mouse the 3 quick 1 brown 1 fox 2 ate 1 mouse 1 Document Mapped Reduced

  11. Programming Model: Example • Eg: Word count using MapReduce • the quick • brown fox • the fox ate • the mouse the, 3 quick, 1 brown, 1 fox, 2 ate, 1 mouse, 1 Map the, 1 quick, 1 brown, 1 fox, 1 Reduce the, 1 fox, 1 ate,1 the, 1 mouse, 1 Map Map Output Reduce Input

  12. The Map Operation Input Text file Output  (“fox”, “1”) map(String key, String value): for each word w in value: EmitIntermediate(w, "1"); Document Name Document Contents Intermediate key/value pair – Eg: (“fox”, “1”)

  13. The Reduce Operation Input  (“fox”, {“1”, “1”}) Output  (“fox”, “2”) reduce(String key, Iterator values): int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); Word List of Counts (Output from Map) Accumulated Count

  14. MapReduce in Practice • Reverse Web-Link Graph Source Web page 1 Source Web page 2 Source Web page 5 Target (My web page) Source Web page 4 Source Web page 3

  15. MapReduce in Practice Contd. • Reverse Web-Link Graph (“My Web”, “Source 1”) (“Not My Web”, “Source 2”) Map (“My Web”, “Source 3”) (“My Web”, “Source 4”) (“My Web”, “Source 5”) Source web pages Target Source pointing to the target Reduce (“My Web”, {“Source 1”, “Source 3”,.....})

  16. Implementation: Execution Overview User Program (1) Fork (1) Fork Master (1) Fork (2) Assign Map Input Layer Map Layer Intermediate Files Reduce Layer Output Layer (2) Assign Reduce Split 0 Worker Split 1 Worker O/P File 0 (3) Read (4) Local Write Split 2 Worker (6) Write (5) Remote Read Split 3 Worker O/P File 1 Worker Split 4

  17. Complexities in Distributed Computing MapReduce to the Rescue • Complexities in Distributed Computing, to be solved • Automatic parallelization using Map & Reduce • How to parallelize the computation? • How to parallelize the computation? • Coordinate with other nodes • Handling failures • Preserve bandwidth • Load balancing

  18. Implementation: Parallelization • Restricted Programming model • User specified Map & Reduce functions • 1000s of workers, different data sets Worker1 Data User-defined Map/Reduce Instruction Worker2 Worker3

  19. MapReduce to the Rescue • Complexities in Distributed Computing, solving.. • Automatic parallelization using Map & Reduce • Coordinate with other nodes • Coordinate nodes using a master node • Handling failures • Preserve bandwidth • Load balancing

  20. Implementation: Coordination • Master data structure • Pushing information (meta-data) between workers Master Information Information Map Worker Reduce Worker

  21. MapReduce to the Rescue • Complexities in Distributed Computing , solving.. • Automatic parallelization using Map & Reduce • Coordinate nodes using a master node • Handling failures • Fault tolerance (Re-execution) & back up tasks • Preserve bandwidth • Load balancing

  22. Implementation: Fault Tolerance • No response from a worker task? • If an ongoing Map or Reduce task: Re-execute • If a completed Map task: Re-execute • If a completed Reduce task: Remain untouched • Master failure (unlikely) • Restart

  23. Implementation: Back Up Tasks • “Straggler”: machine that takes a long time to complete the last steps in the computation • Solution: Redundant Execution • Near end of phase, spawn backup copies • Task that finishes first "wins"

  24. MapReduce to the Rescue • Complexities in Distributed Computing , solving.. • Automatic parallelization using Map & Reduce • Coordinate nodes using a master node • Fault tolerance (Re-execution) & back up tasks • Preserve bandwidth • Saves bandwidth through locality • Load balancing

  25. Implementation: Optimize Locality • Same data set in different machines • If a task has data locally, no need to access other nodes

  26. MapReduce to the Rescue • Complexities in Distributed Computing , solved • Complexities in Distributed Computing , solving.. • Automatic parallelization using Map & Reduce • Coordinate nodes using a master node • Fault tolerance & back up tasks • Saves bandwidth through locality • Load balancing through granularity • Load balancing

  27. Implementation: Load Balancing • Fine granularity tasks: map tasks > machines • 1 worker  several tasks • Idle workers are quickly assigned to work

  28. Extensions • Partitioning • Combining • Skipping bad records • Debuggers – local execution • Counters

  29. Performance – Back Up Tasks • 44% increment in time • Very long tail • Stragglers take >300s to finish 891 S 1283 S Normal Execution No backup tasks

  30. Performance – Fault Tolerance • 5% increment in time • Quick failure recovery 933 S 891 S Normal Execution 200 processes killed

  31. MapReduce at Google • Clustering for Google News and Google Product Search • Google Maps • Locating addresses • Map tiles rendering • Google PageRank • Localized Search

  32. Current Trends – HadoopMapReduce • Apache HadoopMapReduce • Hadoop Distributed File System (HDFS) • Used in, • Yahoo! Search • Facebook • Amazon • Twitter • Google

  33. Current Trends – HadoopMapReduce • Higher level languages/systems based on Hadoop • Amazon Elastic MapReduce • Available for general public • Process data in the cloud • Pig and Hive

  34. Conclusion • Large variety of problems can be expressed as Map & Reduce • Restricted programming model • Easy to hide details of distributed computing • Achieved scalability & programming efficiency

More Related