1 / 30

Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re

Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re. August 2011. MapReduce is victorious. Google statistics: Hadoop statistics: 7 PB+ Vertica clusters vs. 22 PB+ Cloudera H adoop clusters 1.

norah
Télécharger la présentation

Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic optimization of MapReduce ProgramsMichael Cafarella, EamanJahani, Christopher Re August 2011

  2. MapReduceis victorious • Google statistics: • Hadoop statistics: 7 PB+ Vertica clusters vs. 22 PB+ ClouderaHadoop clusters1 1. Omer Trajman, Cloudera VP, http://www.dbms2.com/

  3. MapReduce in relational land • Designers original Intention: free-formed data • web-scale indexing/log processing • But, many relational workloads1 • Complex queries/data analysis • Caveat: MR performance lags RDBMS performance Karmaspherecorporation: A study of hadoopdevelopers, http://karmasphere.com, 2010

  4. Selection is Slower with MapReduce Pavlo et al., A Comparison of Approaches to Large-Scale Data Analysis, SIGMOD 2009

  5. Join is Even Slower Pavlo et al., A Comparison of Approaches to Large-Scale Data Analysis, SIGMOD 2009

  6. MR Lags in Relational Land • Stonebraker, Dewitt: ''MapReduce has no indexes and therefore has only brute force as a processing option. It will be creamed whenever an index is the better access mechanism.’’1 • Query processing tasks • No metadata, semantics, indices • Free-formed input is a double-edged sword 1. MapReduce: a major step backwards, http://databasecolumn.vertica.com/, 2008

  7. Manimal • Manimal is a hybrid system, combining MapReduce programming model and well-known execution techniques • Techniques today only found in RDBMS, but shouldbe in MapReduce, too.

  8. Manimal Approach • MR Engine • Static Analyzer • Optimizer logic • Execution Framework void map(Text key, WebPage w) { if(w.rank > 10) emit(w.url,w.rank); } • Challenges: • Safely detect query semantic optimization • How much performance gain? optimization execution bytecode *.class path opportunities SELECTION from B+Tree index on W.RANK

  9. Manimal Contributions • Our Manimal system: • Detect safe relational optimizations in users’ compiled MapReduce programs • Our results: • Runs with unmodified MapReduce code • Runs up to 11x faster on same code • Provides framework for more optimizations

  10. Outline • Introduction • Execution Framework • Optimization/Analyzer Examples • Experiments • Analyzer recall • Performance gain • Related Work and Conclusion

  11. Execution framework public void map(Text key, WebPagew,OutputCollector<Text, LongWritable> out) { if(w.rank > 10) emit(w.url, w.rank); }

  12. Execution Framework Analyzer Execution Optimizer varload‘value’ invokevirtual astore‘text’ … ifeq …

  13. Execution Framework (SELECTf, w.rank>10) Analyzer Execution Optimizer void map(k, w) { out.set(indexedOutputFormat); emit(w.rank, (k,w)) } varload‘value’ invokevirtual astore‘text’ … ifeq … Analyzer in: user program Analyzerout: optimization descriptor index-generation program

  14. Execution Framework (SELECT,“log.1.idx”,w.rank>10) (SELECTf, w.rank>10) Analyzer Execution Optimizer varload‘value’ invokevirtual astore‘text’ … ifeq … Optimizer in: optimization descriptor catalog Optimizer out: execution descriptor

  15. Execution Framework (SELECT,“log.1.idx”,w.rank>10) Analyzer Execution Optimizer varload‘value’ invokevirtual astore‘text’ … ifeq … numwords 19519 Execution in: execution descriptor user program Execution out: program output

  16. Outline • Introduction • Execution Framework • Optimization/Analyzer Examples • Experiments • Analyzer recall • Performance gain • Related Work and Conclusion

  17. An Optimization Example //webpage.java: SCHEMA! Class WebPage {String URL,intrank,String content} //mapper.java void map(Text key, WebPage w) { if (w.url==‘teaparty.fr’) emit(w.url, 1); } • Data-centric programming idioms == relational ops PROJECTED view: (url,null,null) DIRECT-OP on compressed Webpage

  18. Semantic Extraction • Query semantic are obvious to human readers, but not explicit in the code for framework • EXTRACT IT! • Static code analysis • Control-flow graph and data-flow graph • Find opportunities: selection, projection, direct op • Safe optimizations: same output

  19. Analyzer: An Example //webpage.java Class WebPage {String URL,intrank,String content} //mapper.java map(Text key,Webpagew) { if (w.rank > 10) emit(w.url,w.rank); } Analyzer w.rank > 10 emit(url,rank) Fn Exit Fn Entry

  20. Current Optimizations • B+-Tree for Selections • Projected views • Delta compression on numerics • Direct operation of compressed data • Hadoopcompression is not semantic aware

  21. Outline • Introduction • Execution Framework • Optimization/Analyzer Examples • Experiments • Analyzer recall • Performance gain • Related Work and Conclusion

  22. Experiments: Analyzer • Test MapReduce programs from Pavlo, SIGMOD ‘09: • Detected 5 out of 8 opportunities: • Two misses due to custom serialization class • Another miss requires knowledge of java.util.Hashtable semantics

  23. Experiments: Performance • Optimize four Web page handling tasks: • Selection (filtering) • Projection (aggregation on subfield of page) • Join (pages to user visits) • User Defined Functions (aggregation) • 5 cluster nodes, 123GB of data

  24. Experiments: Performance

  25. Experiments: Performance

  26. Experiments: Performance • Up to 11x speedup over original Hadoop • Performance comparable to DBMS-X from Pavlo • UDF not detected: running time identical

  27. Outline • Introduction • Execution Framework • Optimization/Analyzer Examples • Experiments • Analyzer recall • Performance gain • Related Work and Conclusion

  28. Related Work • Lots of recent MapReduce activity • Quincy: Task scheduling(Isard et al, SOSP, 2009) • HadoopDB(Abouzeidet al, PVLDB 2009) • Hadoop++ (Dittrich et al, PVLDB 2010) • HaLoop(Bu et al, PVLDB 2010) • Twister (Ekanayakeet al, HPDC 2010) • Starfish (Herodotou et al, CIDR 2011) • Manimal does not introduce new optimizations. It detects and applies existing optimizations to code

  29. Lessons Learned • The Good: We can recognize data processing idioms in real code. Relational operations still exist even in NoSQL world • The Ugly: When we started this project in 2009, we underestimated interest in writing in higher level languages (e.g., Pig Latin)

  30. Conclusion • Manimal provides framework for applying well-known optimization techniques to MapReduce • Automatic optimizationof user code • Up to 11x speed increase • Provides framework for more optimizations

More Related