MapReduce : Simplified Data Processing on Large Clusters

MapReduce: Simplified Data Processing on Large Clusters 컴퓨터학과 김정수

목차 1. MapReduce 2. Implementation 3. Refinements 4. Performance 5. Experience 6. Conclusions References

2 5 1 3 4 1. MapReduce 1) What is the MapReduce?

1. MapReduce 2) Why is the MapReduce needed? - Parallelizing the computation - Distributing the data - Handling failures with complexity code - Dealing with large-scale computations efficiently on large cluster system

2. Implementation 1) Execution Overview 2) Master Data Structure 3) Fault Tolerance - Worker Failure - Master Failure - Semantics in the Presence of Failure 4) Locality 5) Task Granularity 6) Backup Tasks

3. Refinements 1) Partitioning Function 2) Ordering Guarantees 3) Combiner Function 4) Input and Output types 5) Side-effects 6) Skipping Bad Records 7) Local Execution 8) Status Informations 9) Counters

4. Performance 1) Grep - Three-character pattern - Total records are 92337 - Input split into 64MB - M = 15000, R = 1

4. Performance 2) Sort - Total 50 lines of user code. - Approximately 1Tbytes of data. - Input split into 64MB - M = 15000, R = 4000 - Top graph shows rate at which input is read. - Middle graph shows the rate at which data is sent over the network to the reduce tasks. - Bottom graph shows the rate at which sorted data is written to the final output files.

4. Performance 2) Effecct of Backup Tasks - 5 straggler remains after almost tasks are finished. - It took 1283 seconds. - Increased 44% time of computation.

4. Performance 2) Machine Failures - 200 workers were killed. - Workers below than 0 in top graph were re-executed. - Only 5% of execution time is higher than normal execution.

5. Experience 1) Benefits using MapReduce system - Source code is simplified because of MapReduce hides failure tolerance, distributing and parallelizing. - MapReduce system makes it easy to change the indexing process. - MapReduce system solves many problem(machine failures, slow machines, etc.)

2 5 1 3 4 6. Conclusions 1) MapReduce can be used by programmers even they don’t have any experience parallel and distributed system. 2) A large variety of problems are easily expressible as MapReducecomputations. 3) We have developed an implementation of MapReduce that scales to large clusters of machines comprising thousands of machines.

References 1. http://www.prapps.net/, Mazdah의개인 블로그- Bigdata Section

Thank you

MapReduce : Simplified Data Processing on Large Clusters

MapReduce : Simplified Data Processing on Large Clusters

Presentation Transcript

Leadership

Career Clusters: Focusing Education on the Future

Seismic Reflection: Processing and Interpretation

ADCP Toolbox How - To

Optimizing Iterative MapReduce Jobs

NGS Data Processing

Order and Chaos

Data Management and Data Processing Support on Array-Based Scientific Data

Multiple Indicator Cluster Surveys Survey Design Workshop

Hematopoiesis Simplified: Part 1 Erythropoiesis

Apache Mahout Feb 13, 2012 Shannon Quinn

MapReduce , Collective Communication, and Services

Processing XML Documents

The Early Methods of Data Processing

XML and Web Data

High Energy Astrophysics

Data Mining 2

The Wind Lidar Mission ADM-Aeolus Data Processing

Hello!

Data Processing

Merchant Advisors - Your Small Business Funding Experts