MapReduce

MapReduce By: Jeffrey Dean & Sanjay Ghemawat Presented by: WarunikaRanaweera Supervised by: Dr. NalinRanasinghe

Paper MapReduce: Simplified Data Processing on Large Clusters In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI' 04) Also appears in the Communications of the ACM (2008)

Authors – Jeffrey Dean • Ph.D. in Computer Science – University of Washington • Google Fellow in Systems and Infrastructure Group • ACM Fellow • Research Areas: Distributed Systems and Parallel Computing

Authors – Sanjay Ghemawat • Ph.D. in Computer Science – Massachusetts Institute of Technology • Google Fellow • Research Areas: Distributed Systems and Parallel Computing

Large Computations • Calculate 30*50 Easy? • 30*50 + 31*51 + 32*52 + 33*52 + .... + 40*60 Little bit hard?

Large Computations • Simple computation, but huge data set • Real world example for large computations • 20+ billion web pages * 20kB webpage • One computer reads 30/35 MB/sec from disc • Nearly four months to read the web

Good News: Distributed Computation • Parallelize tasks in a distributed computing environment • Web page problem solved in 3 hours with 1000 machines

Though, the bad news is... • Complexities in Distributed Computing • How to parallelize the computation? • Coordinate with other nodes • Handling failures • Preserve bandwidth • Load balancing

MapReduce to the Rescue • A platform to hide the messy details of distributed computing • Which are, • Parallelization • Fault-tolerance • Data distribution • Load Balancing • A programming model • An implementation

MapReduce: Programming Model • Example: Word count the 1 the 1 the 1 the 1 quick 1 brown 1 fox 1 the 1 fox 1 ate 1 the 1 mouse 1 • the quick • brown fox • the fox ate • the mouse the 3 quick 1 brown 1 fox 2 ate 1 mouse 1 Document Mapped Reduced

Programming Model: Example • Eg: Word count using MapReduce • the quick • brown fox • the fox ate • the mouse the, 3 quick, 1 brown, 1 fox, 2 ate, 1 mouse, 1 Map the, 1 quick, 1 brown, 1 fox, 1 Reduce the, 1 fox, 1 ate,1 the, 1 mouse, 1 Map Map Output Reduce Input

The Map Operation Input Text file Output  (“fox”, “1”) map(String key, String value): for each word w in value: EmitIntermediate(w, "1"); Document Name Document Contents Intermediate key/value pair – Eg: (“fox”, “1”)

The Reduce Operation Input  (“fox”, {“1”, “1”}) Output  (“fox”, “2”) reduce(String key, Iterator values): int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); Word List of Counts (Output from Map) Accumulated Count

MapReduce in Practice • Reverse Web-Link Graph Source Web page 1 Source Web page 2 Source Web page 5 Target (My web page) Source Web page 4 Source Web page 3

MapReduce in Practice Contd. • Reverse Web-Link Graph (“My Web”, “Source 1”) (“Not My Web”, “Source 2”) Map (“My Web”, “Source 3”) (“My Web”, “Source 4”) (“My Web”, “Source 5”) Source web pages Target Source pointing to the target Reduce (“My Web”, {“Source 1”, “Source 3”,.....})

Implementation: Execution Overview User Program (1) Fork (1) Fork Master (1) Fork (2) Assign Map Input Layer Map Layer Intermediate Files Reduce Layer Output Layer (2) Assign Reduce Split 0 Worker Split 1 Worker O/P File 0 (3) Read (4) Local Write Split 2 Worker (6) Write (5) Remote Read Split 3 Worker O/P File 1 Worker Split 4

Complexities in Distributed Computing MapReduce to the Rescue • Complexities in Distributed Computing, to be solved • Automatic parallelization using Map & Reduce • How to parallelize the computation? • How to parallelize the computation? • Coordinate with other nodes • Handling failures • Preserve bandwidth • Load balancing

Implementation: Parallelization • Restricted Programming model • User specified Map & Reduce functions • 1000s of workers, different data sets Worker1 Data User-defined Map/Reduce Instruction Worker2 Worker3

MapReduce to the Rescue • Complexities in Distributed Computing, solving.. • Automatic parallelization using Map & Reduce • Coordinate with other nodes • Coordinate nodes using a master node • Handling failures • Preserve bandwidth • Load balancing

Implementation: Coordination • Master data structure • Pushing information (meta-data) between workers Master Information Information Map Worker Reduce Worker

MapReduce to the Rescue • Complexities in Distributed Computing , solving.. • Automatic parallelization using Map & Reduce • Coordinate nodes using a master node • Handling failures • Fault tolerance (Re-execution) & back up tasks • Preserve bandwidth • Load balancing

Implementation: Fault Tolerance • No response from a worker task? • If an ongoing Map or Reduce task: Re-execute • If a completed Map task: Re-execute • If a completed Reduce task: Remain untouched • Master failure (unlikely) • Restart

Implementation: Back Up Tasks • “Straggler”: machine that takes a long time to complete the last steps in the computation • Solution: Redundant Execution • Near end of phase, spawn backup copies • Task that finishes first "wins"

MapReduce to the Rescue • Complexities in Distributed Computing , solving.. • Automatic parallelization using Map & Reduce • Coordinate nodes using a master node • Fault tolerance (Re-execution) & back up tasks • Preserve bandwidth • Saves bandwidth through locality • Load balancing

Implementation: Optimize Locality • Same data set in different machines • If a task has data locally, no need to access other nodes

MapReduce to the Rescue • Complexities in Distributed Computing , solved • Complexities in Distributed Computing , solving.. • Automatic parallelization using Map & Reduce • Coordinate nodes using a master node • Fault tolerance & back up tasks • Saves bandwidth through locality • Load balancing through granularity • Load balancing

Implementation: Load Balancing • Fine granularity tasks: map tasks > machines • 1 worker  several tasks • Idle workers are quickly assigned to work

Extensions • Partitioning • Combining • Skipping bad records • Debuggers – local execution • Counters

Performance – Back Up Tasks • 44% increment in time • Very long tail • Stragglers take >300s to finish 891 S 1283 S Normal Execution No backup tasks

Performance – Fault Tolerance • 5% increment in time • Quick failure recovery 933 S 891 S Normal Execution 200 processes killed

MapReduce at Google • Clustering for Google News and Google Product Search • Google Maps • Locating addresses • Map tiles rendering • Google PageRank • Localized Search

Current Trends – HadoopMapReduce • Apache HadoopMapReduce • Hadoop Distributed File System (HDFS) • Used in, • Yahoo! Search • Facebook • Amazon • Twitter • Google

Current Trends – HadoopMapReduce • Higher level languages/systems based on Hadoop • Amazon Elastic MapReduce • Available for general public • Process data in the cloud • Pig and Hive

Conclusion • Large variety of problems can be expressed as Map & Reduce • Restricted programming model • Easy to hide details of distributed computing • Achieved scalability & programming efficiency

MapReduce

MapReduce

Presentation Transcript

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce:

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka

MapReduce

MapReduce