MapReduce : Simplified Data Processing on Large Clusters

Jonathan Light MapReduce: Simplified Data Processing on Large Clusters

Contents • Abstract • Introduction • Locality • Task Granularity • Backup Tasks • Questions

Abstract: MapReduce • MapReduce processes and generates large data sets • Programs written for MapReduce are parallelized and run on a cluster of computers • Programmer doesn’t have to know how to use distributed systems

Abstract: Google’s Implementation • Google uses “a large cluster of commodity machines” • A single job can process many terabytes of data • Nearly 1,000 jobs are ran every day

Introduction • Computations have been written to process documents, web logs, and raw data • Computations are easy, but the data set is large • Google developed an abstraction to allow users run distributed computations without knowing about distributed computing

Locality • Due to low network capacity data is stored on the local disk of each machine • The files are split into 64MB blocks. Each block is copied three times to other machines

Locality continued • The master machine tries to schedule jobs on the machine that contains the data. • Otherwise, it schedules a job close to a machine containing the job data • It’s possible that a job may use no network bandwidth

Task Granularity • The map and reduce phases are split into different size pieces • Total phase pieces should be larger than the number of worker machines • This helps with load balancing and recovery speed

Task Granularity continued • Reduce phase pieces are usually constrained by users since each task is in a separate output file • The number of map phase pieces are chosen so that the input data size is between 16MB and 64MB • Google usually uses 200,000 map pieces, 5,000 reduce pieces, and 2,000 workers

Backup Tasks • “Straggler” machines can cause large total computation time • Stragglers can be caused by many different reasons • Straggler alleviation is possible

Backup Tasks continued • The master backs up in-progress tasks when the operation is close to finishing • Task is marked as complete when the primary task or backup completes • Backup task overhead has been tuned to a couple percent • An example task takes 44% longer when the backup is disabled

Recap • Computations are easy, data is large • Small network utilization due to data locality and smart scheduling • Map and Reduce tasks are split into pieces • Straggler workers can arise, but these problems can be mitigated

Questions?

MapReduce : Simplified Data Processing on Large Clusters