1 / 36

Incoop : MapReduce for I ncremental Computation

Incoop : MapReduce for I ncremental Computation. Pramod Bhatotia Alexander Wieder , Rodrigo Rodrigues, Umut A. Acar , Rafael Pasquini Max Planck Institute for Software Systems (MPI-SWS ). ACM SOCC 2011. Large-scale data processing. Need to process growing large data-sets

jamese
Télécharger la présentation

Incoop : MapReduce for I ncremental Computation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Incoop: MapReduce for Incremental Computation Pramod Bhatotia Alexander Wieder, Rodrigo Rodrigues, UmutA. Acar, Rafael Pasquini Max Planck Institute for Software Systems (MPI-SWS) ACM SOCC 2011

  2. Large-scale data processing Need to process growing large data-sets Use of distributed and data-parallel computing MapReduce: De-facto data processing paradigm Simple, yet powerful programming model Widely adopted to support various services Pramod Bhatotia

  3. Incremental data processing Applications repeatedly process evolving data-sets For search page rank is re-computed for every new crawl Online data-sets evolve slowly Successive input data-sets change by 0.1% to 10% Need for incremental computations Instead of re-computing from scratch Pramod Bhatotia

  4. Incremental data processing Recent proposals for incremental processing “MapReduce (…) cannot process small updates individually as they rely on creating large batches for efficiency.” • Google Percolator [OSDI’10] • CBP (Yahoo! /UCSD) [SOCC’10] Drawbacks of these systems Adopt a new programming model Require implementation of dynamic algorithms Pramod Bhatotia

  5. Goals Retain the simplicity of bulk data processing systems Achieve the efficiency of incremental processing Can we meet these goals for MapReduce ? Take an unmodified MapReduce application Automatically adapt it to handle incremental input changes Pramod Bhatotia

  6. Incoop: Incremental Mapreduce Incremental bulk data processing Transparent Efficient Inspired by algorithms/PL research Provable asymptotic gains Efficient implementation based on Hadoop Pramod Bhatotia

  7. Outline Motivation Incoop design Challenges faced Evaluation Pramod Bhatotia

  8. Background: MapReduce a a b a b c c b Map(input) { foreach word in input output (word,1); } Reduce(key,list(v)) { print key + SUM(v); } (a,1) (a,1) (a,1) (b,1) Reduce (a,<1,1,1>)  a = 3 Output Pramod Bhatotia

  9. Basic design Basic principle: “Self-adjusting computations” Break computation into sub-computations Memoize the results of sub-computations Track dependencies between input and computation Re-compute only the parts affected by changes Pramod Bhatotia

  10. Basic design Changes propagate through dependence graph Read input Map tasks Reduce tasks Write output Pramod Bhatotia

  11. Outline Motivation Incoop design Challenges faced Evaluation Pramod Bhatotia

  12. Challenges Stability How to efficiently handle insertion/deletion in input? Granularity How to perform fine-grained updates to the output? Scheduling How to minimize data movement? Pramod Bhatotia

  13. Challenge 1: Stability Stability: Small changes in the input lead to small changes in the dependence graph Stable dependence graph  efficient change propagation Is the basic approach stable? Pramod Bhatotia

  14. Challenge 1: Stability Read input Map tasks Reduce tasks Write output Pramod Bhatotia

  15. Challenge 1: Stability Read input Map tasks Reduce tasks Write output Pramod Bhatotia

  16. Challenge 1: Stability Solution: Content-based chunking Avoid partitioning at fixed offset Instead, use property of input contents Example: Assume partition upon “b”: a a babc c b a Pramod Bhatotia

  17. Challenge 1: Stability Incremental HDFS Upon file write, compute Rabin fingerprint of sliding window contents Fingerprint matches pattern  boundary Content-based chunking addresses stability Probability of finding pattern controls granularity Pramod Bhatotia

  18. Challenge 2: Granularity Coarse-grained change propagation can be inefficient Even for small input change, large task need to be recomputed Not an issue for Map tasks Incremental HDFS controls granularity Difficult to address for reducers Reducer processes all values for a given key Depends exclusively on computation and input Pramod Bhatotia

  19. Challenge 2: Granularity Read input Map tasks Reduce task Write output Pramod Bhatotia

  20. Challenge 2: Granularity Leverage Combiners: Pre-processing of Reduce Co-located with Map task Preprocesses Map outputs Meant to reduce bandwidth Pramod Bhatotia

  21. Background: Combiners a a b a b c c b Combine(set of <k,v>) { foreach distinct k output(<k,SUM(v)>); } ––––––– (a,2) (a,1) (a,1) a = 3 Output Pramod Bhatotia

  22. Challenge 2: Granularity Contraction tree Run Combiners at Reducer site as well Use them to break up Reduce work Combiners Combiners Reduce task Pramod Bhatotia

  23. Challenge 2: Granularity Read input Map tasks Reduce task Write output Pramod Bhatotia

  24. Challenge 2: Granularity Read input Map tasks Contraction tree Reduce task Write output Pramod Bhatotia

  25. Challenge 3: Scheduling Scheduler determines where to run each task Based on input data localiy and machine availability New variable for incremental computation Location of previously computed memoized results Memoization-aware scheduling Prevents unnecessary data movement Pramod Bhatotia

  26. Challenge 3: Scheduling Drawback of only memoization-aware scheduling Strict scheduling leads to straggler effect New hybrid scheduling algorithm Data locality to exploit memoized results Flexibility to prevent straggler effect More details are in the paper Pramod Bhatotia

  27. Summary Incoop enables Incremental processing for MapReduce applications Incoop design includes Incremental HDFS for stable input partitioning Contraction tree for fine-grained updates Scheduler for memoized data locality and straggler mitigation Pramod Bhatotia

  28. Outline Motivation Incoop design Challenges faced Evaluation Pramod Bhatotia

  29. Evaluating Incoop Goal: Determine how Incoop works in practice What are the performance benefits? How effective are the optimizations? What are the overheads? Methodology Incoop implementation based on Hadoop Applications (compute and data-intensive) Wikipedia and synthetic data-sets Cluster of 20 machines Pramod Bhatotia

  30. Performance gains For incremental changes Speedups of range 1000X-1.5X for 0% to 25% changes Compute-intensive performs better than data-intensive Pramod Bhatotia

  31. Optimization: Contraction tree • K-nearest neighbor classifier • (b) Co-occurrence matrix Pramod Bhatotia

  32. Optimization: Scheduler • Modified scheduler run-time savings range around • 30% less time for data-intensive applications • 15% less time for compute-intensive applications Pramod Bhatotia

  33. Overhead: Performance Run-time overhead of up to 22% for the first run Incurred only once and subsequent runs are fast Pramod Bhatotia

  34. Overhead: Storage Space usage of upto 9X of the input-size for memoization Garbage collection for bounded storage consumption Pramod Bhatotia

  35. Case-studies Implemented two use cases of incremental processing Incremental log processing (Flume) Continuous query processing (Pig) We transparently benefit them as well Refer to the paper for details Pramod Bhatotia

  36. Conclusions Incremental processing of MapReducecomputations Transparent Efficient Good performance at modest overhead for first run Pramod Bhatotia

More Related