1 / 15

MapReduce : Simplified Data Processing on Large Clusters

MapReduce : Simplified Data Processing on Large Clusters. 컴퓨터학과 김정수. 목차 1. MapReduce 2. Implementation 3. Refinements 4. Performance 5. Experience 6. Conclusions References. 2. 5. 1. 3. 4. 1. MapReduce 1) What is the MapReduce ?. 2. 5. 1. 3. 4. 1. MapReduce

onslow
Télécharger la présentation

MapReduce : Simplified Data Processing on Large Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MapReduce: Simplified Data Processing on Large Clusters 컴퓨터학과 김정수

  2. 목차 1. MapReduce 2. Implementation 3. Refinements 4. Performance 5. Experience 6. Conclusions References

  3. 2 5 1 3 4 1. MapReduce 1) What is the MapReduce?

  4. 2 5 1 3 4 1. MapReduce 1) What is the MapReduce?

  5. 1. MapReduce 2) Why is the MapReduce needed? - Parallelizing the computation - Distributing the data - Handling failures with complexity code - Dealing with large-scale computations efficiently on large cluster system

  6. 2. Implementation 1) Execution Overview 2) Master Data Structure 3) Fault Tolerance - Worker Failure - Master Failure - Semantics in the Presence of Failure 4) Locality 5) Task Granularity 6) Backup Tasks

  7. 3. Refinements 1) Partitioning Function 2) Ordering Guarantees 3) Combiner Function 4) Input and Output types 5) Side-effects 6) Skipping Bad Records 7) Local Execution 8) Status Informations 9) Counters

  8. 4. Performance 1) Grep - Three-character pattern - Total records are 92337 - Input split into 64MB - M = 15000, R = 1

  9. 4. Performance 2) Sort - Total 50 lines of user code. - Approximately 1Tbytes of data. - Input split into 64MB - M = 15000, R = 4000 - Top graph shows rate at which input is read. - Middle graph shows the rate at which data is sent over the network to the reduce tasks. - Bottom graph shows the rate at which sorted data is written to the final output files.

  10. 4. Performance 2) Effecct of Backup Tasks - 5 straggler remains after almost tasks are finished. - It took 1283 seconds. - Increased 44% time of computation.

  11. 4. Performance 2) Machine Failures - 200 workers were killed. - Workers below than 0 in top graph were re-executed. - Only 5% of execution time is higher than normal execution.

  12. 5. Experience 1) Benefits using MapReduce system - Source code is simplified because of MapReduce hides failure tolerance, distributing and parallelizing. - MapReduce system makes it easy to change the indexing process. - MapReduce system solves many problem(machine failures, slow machines, etc.)

  13. 2 5 1 3 4 6. Conclusions 1) MapReduce can be used by programmers even they don’t have any experience parallel and distributed system. 2) A large variety of problems are easily expressible as MapReducecomputations. 3) We have developed an implementation of MapReduce that scales to large clusters of machines comprising thousands of machines.

  14. References 1. http://www.prapps.net/, Mazdah의개인 블로그- Bigdata Section

  15. Thank you

More Related