130 likes | 267 Vues
This document provides a comprehensive overview of MapReduce, a powerful framework that simplifies the processing of large data sets across clusters of computers. It discusses essential concepts, including data locality, task granularity, and the management of backup tasks to optimize performance and minimize computation time. The document highlights Google's implementation of MapReduce, detailing how computations are efficiently scheduled and executed without requiring knowledge of distributed systems. This resource is invaluable for understanding the balance between computation efficiency and data size in large-scale data processing.
E N D
Jonathan Light MapReduce: Simplified Data Processing on Large Clusters
Contents • Abstract • Introduction • Locality • Task Granularity • Backup Tasks • Questions
Abstract: MapReduce • MapReduce processes and generates large data sets • Programs written for MapReduce are parallelized and run on a cluster of computers • Programmer doesn’t have to know how to use distributed systems
Abstract: Google’s Implementation • Google uses “a large cluster of commodity machines” • A single job can process many terabytes of data • Nearly 1,000 jobs are ran every day
Introduction • Computations have been written to process documents, web logs, and raw data • Computations are easy, but the data set is large • Google developed an abstraction to allow users run distributed computations without knowing about distributed computing
Locality • Due to low network capacity data is stored on the local disk of each machine • The files are split into 64MB blocks. Each block is copied three times to other machines
Locality continued • The master machine tries to schedule jobs on the machine that contains the data. • Otherwise, it schedules a job close to a machine containing the job data • It’s possible that a job may use no network bandwidth
Task Granularity • The map and reduce phases are split into different size pieces • Total phase pieces should be larger than the number of worker machines • This helps with load balancing and recovery speed
Task Granularity continued • Reduce phase pieces are usually constrained by users since each task is in a separate output file • The number of map phase pieces are chosen so that the input data size is between 16MB and 64MB • Google usually uses 200,000 map pieces, 5,000 reduce pieces, and 2,000 workers
Backup Tasks • “Straggler” machines can cause large total computation time • Stragglers can be caused by many different reasons • Straggler alleviation is possible
Backup Tasks continued • The master backs up in-progress tasks when the operation is close to finishing • Task is marked as complete when the primary task or backup completes • Backup task overhead has been tuned to a couple percent • An example task takes 44% longer when the backup is disabled
Recap • Computations are easy, data is large • Small network utilization due to data locality and smart scheduling • Map and Reduce tasks are split into pieces • Straggler workers can arise, but these problems can be mitigated