1 / 27

MC 2 : Map Concurrency Characterization for MapReduce on the Cloud

MC 2 : Map Concurrency Characterization for MapReduce on the Cloud. Mohammad Hammoud and Majd Sakr. Hadoop MapReduce. MapReduce is now a pervasive data processing framework on the cloud Hadoop is an open source implementation of MapReduce

tad-hyde
Télécharger la présentation

MC 2 : Map Concurrency Characterization for MapReduce on the Cloud

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MC2: Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and MajdSakr

  2. HadoopMapReduce • MapReduce is now a pervasive data processing framework on the cloud • Hadoop is an open source implementation of MapReduce • HadoopMapReduce incorporates two phases, Map and Reduce phases, which encompass multiple Map and Reduce tasks Map Task Split 0 HDFS BLK Partition Reduce Task Partition Partition Partition Partition Map Task Partition Split 1 HDFS BLK Dataset Partition Reduce Task To HDFS Partition Partition Partition Map Task Split 2 HDFS BLK HDFS Partition Partition Reduce Task Partition Partition Partition Map Task Split 3 HDFS BLK Partition Partition Merge Stage Shuffle Stage Reduce Stage Map Phase Reduce Phase

  3. How to Effectively Configure Hadoop? • Hadoop has more than 190 configuration parameters • 10-20 parameters can have significant impact on job performance • A main challenge that faces Hadoop users on the cloud: • Running MapReduce applications in the most economical way • While still achieving good performance • The burden falls on Hadoop users to effectively configure Hadoop • Hadoop’sdefault configuration is not necessarily optimal • Several X speedup/slowdown between tuned and default Hadoop

  4. Map Tasks and Map Concurrency • Among the influential configuration parameters in Hadoop are: • Number of Map Tasks • Determined by the number of HDFS blocks • Number of Map Slots • Allocated to run Map Tasks Map Concurrency = Map Tasks/Map Slots Core Switch Rack Switch 1 Rack Switch 2 TaskTracker5 TaskTracker2 JobTracker TaskTracker3 TaskTracker4 TaskTracker1 MT3 MT3 MT1 MT2 MT2 Request a Map Task Schedule a Map Task at an Empty Map Slot on TaskTracker1

  5. Impact of Map Concurrency Tuned Hadoop Default Hadoop • Observations: • Map concurrency has a strong impact on Hadoopperformance • Hadoop’s default Map concurrency settings are not optimal • For effective execution, Hadoop might require different Map concurrencies for different applications WordCount-CD Sobel K-Means Sort

  6. Our Work • We propose MC2: • A simple, fast and static “utility” program • Which predicts the best Map Concurrency for any given MapReduce application • MC2 is based on a mathematical model which exploits two main MapReduce internal characteristics: • Map Setup Time (or the total overhead for setting up all map tasks in a job) • Early Shuffle (or the process of shuffling intermediate data while the Map phase is still running)

  7. Talk Roadmap • Characterizing Map Concurrency • Map Concurrency ≤ 1 • Map Concurrency > 1 • A Mathematical Model for Predicting Runtimes of MR jobs • The MC2 Predictor • Quantitative Evaluation • Concluding Remarks

  8. Talk Roadmap • Characterizing Map Concurrency • Map Concurrency ≤ 1 • Map Concurrency > 1 • A Mathematical Model for Predicting Runtimes of MR jobs • The MC2 Predictor • Quantitative Evaluation • Concluding Remarks

  9. Concurrency with a Single Map Wave • The maximum number of concurrent Map tasks is bound by the total number of Map slots in a Hadoop cluster • We refer to the maximum concurrent Map tasks as Map Wave MST = Map Setup Time MST 2MST t/4 + t/4 = t/2 t/2 t MS1 MS1 MS1 MS2 MS2 MS2 MS3 MS3 MS3 MT4 MS4 MS4 MS4 MT3 MT1 MS5 MS5 MS5 MT1 MT2 MS6 MS6 MS6 MT2 Ends at time t/2 + 2MST Ends at time t + MST Ends at time t/2 + MST • Fill as much Map slots as possible within a Map wave • More Parallelism & Better Utilization

  10. Concurrency with Multiple Map Waves • What are the tradeoffs as the number of Map waves is varied? 4MST t = t/4 + t/4 + t/4 + t/4 t = t/2 + t/2 MST 2MST t MS1 MS1 MS1 MT1 MS2 MS2 MS2 MT2 MS3 MT3 MS3 MS3 MT4 MS4 MS4 MS4 Reduce Reduce Reduce Shuffle & Merge Shuffle & Merge Shuffle & Merge RS1 RT1 RT1 RS1 RS1 RT1 Reduce Reduce Reduce Shuffle & Merge Shuffle & Merge Shuffle & Merge RT2 RT2 RS2 RT2 RS2 RS2 One Map Wave Four Map Waves Two Map Waves [+ With More Phase Overlap] [-No Phase Overlap] [+ With Phase Overlap] [- Map Setup Time = 2MST] [- Map Setup Time = 4MST] [Map Setup Time =MST] [Map Time = t] [Map Time = t] [Map Time = t] • As the number of Map waves is increased: (-) Map Setup Time increases--Cost (+) Data Shuffling starts earlier (i.e., earlierEarly Shuffle)-- Opportunity

  11. When to Trigger Early Shuffle? • Early Shuffle can be activated earlier by increasing the number of Map Waves • The preference of when exactly the early shuffle process must be activated varies across applications • The more the amount of data an application shuffles, the earlier the early shuffle process must be triggered • With a larger shuffle data, a larger number of map waves is preferred • We devise a mathematical model that allows locating the best number of map waves for any given MR application

  12. Talk Roadmap • Characterizing Map Concurrency • Map Concurrency ≤ 1 • Map Concurrency > 1 • A Mathematical Model for Predicting Runtimes of MR jobs • The MC2 Predictor • Quantitative Evaluation • Concluding Remarks

  13. A Mathematical Model (1) Total Map Setup Time (MST) Exposed Shuffle Time (EST) MS1 MS2 MS3 Runtime MS4 Reduce Shuffle & Merge RS1 RS2 Hidden Shuffle Time (HST) Reduce Time • Assumptions: • Map tasks start and finish at similar times • Time impact of speculative execution is masked • Ignore slow Mappers and Reducers • Map time is typically longer than Map Setup Time

  14. A Mathematical Model (2) Total Map Setup Time (MST) Exposed Shuffle Time (EST) MS1 MS2 MS3 Runtime MS4 Reduce Shuffle & Merge RS1 RS2 Hidden Shuffle Time (HST) Reduce Time (1) (2) (3) (4)

  15. Talk Roadmap • Characterizing Map Concurrency • Map Concurrency ≤ 1 • Map Concurrency > 1 • A Mathematical Model for Predicting Runtimes of MR jobs • The MC2 Predictor • Quantitative Evaluation • Concluding Remarks

  16. MC2: Map Concurrency Characterization • Our mathematical model can be utilized to predict the best number of map waves for any given MR application • Fix all the model’s factors except the “Number of Map Waves” • Measure Runtime for a range of map wave numbers • Select the minimum Runtime Shuffle Data Sweet Spot Compute: Shuffle Rate Single Map Wave Time • Total MST • HST • EST • Runtime MST Reduce Time Initial Map Slots Number

  17. Talk Roadmap • Characterizing Map Concurrency • Map Concurrency ≤ 1 • Map Concurrency > 1 • A Mathematical Model for Predicting Runtimes of MR jobs • The MC2 Predictor • Quantitative Evaluation • Concluding Remarks

  18. Quantitative Methodology • We evaluate MC2 on: • A private cloud with 14 machines • Amazon EC2 with 20 large instances • We use Apache Hadoop 0.20.2 • We use various benchmarks with different dataset sizes

  19. Results: WordCount-CE

  20. Results: K-Means

  21. Results: Sort

  22. Results: WordCount-CD

  23. Results: Sobel

  24. MC2 Results: Summary • MC2 correctly predicts the best numbers of map waves for WordCount-CE, K-Means, Sort, WordCount-CD and Sobel on our private cloud and on Amazon EC2 • Even if a miss-prediction occurs, it is typically the case that the sweet spot is very close to the observed minimum Runtime speedups provided by MC2 versus default Hadoop

  25. Talk Roadmap • Characterizing Map Concurrency • Map Concurrency ≤ 1 • Map Concurrency > 1 • A Mathematical Model for Predicting Runtimes of MR jobs • The MC2 Predictor • Quantitative Evaluation • Concluding Remarks

  26. Concluding Remarks • We observed a strong dependency between map concurrency and MapReduce performance • We realized that a good map concurrency configuration can be determined by simply leveraging two main MapReduce characteristics, data shuffling and map setup time (MST) • We developed a mathematical model that exploits data shuffling and MST, and built MC2 which uses the model to predict the best map concurrency for any given MR application • MC2 works successfully on a private cloud and on Amazon EC2

  27. Thank You! Questions?

More Related