Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re

Automatic optimization of MapReduce ProgramsMichael Cafarella, EamanJahani, Christopher Re August 2011

MapReduceis victorious • Google statistics: • Hadoop statistics: 7 PB+ Vertica clusters vs. 22 PB+ ClouderaHadoop clusters1 1. Omer Trajman, Cloudera VP, http://www.dbms2.com/

MapReduce in relational land • Designers original Intention: free-formed data • web-scale indexing/log processing • But, many relational workloads1 • Complex queries/data analysis • Caveat: MR performance lags RDBMS performance Karmaspherecorporation: A study of hadoopdevelopers, http://karmasphere.com, 2010

Selection is Slower with MapReduce Pavlo et al., A Comparison of Approaches to Large-Scale Data Analysis, SIGMOD 2009

Join is Even Slower Pavlo et al., A Comparison of Approaches to Large-Scale Data Analysis, SIGMOD 2009

MR Lags in Relational Land • Stonebraker, Dewitt: ''MapReduce has no indexes and therefore has only brute force as a processing option. It will be creamed whenever an index is the better access mechanism.’’1 • Query processing tasks • No metadata, semantics, indices • Free-formed input is a double-edged sword 1. MapReduce: a major step backwards, http://databasecolumn.vertica.com/, 2008

Manimal • Manimal is a hybrid system, combining MapReduce programming model and well-known execution techniques • Techniques today only found in RDBMS, but shouldbe in MapReduce, too.

Manimal Approach • MR Engine • Static Analyzer • Optimizer logic • Execution Framework void map(Text key, WebPage w) { if(w.rank > 10) emit(w.url,w.rank); } • Challenges: • Safely detect query semantic optimization • How much performance gain? optimization execution bytecode *.class path opportunities SELECTION from B+Tree index on W.RANK

Manimal Contributions • Our Manimal system: • Detect safe relational optimizations in users’ compiled MapReduce programs • Our results: • Runs with unmodified MapReduce code • Runs up to 11x faster on same code • Provides framework for more optimizations

Outline • Introduction • Execution Framework • Optimization/Analyzer Examples • Experiments • Analyzer recall • Performance gain • Related Work and Conclusion

Execution framework public void map(Text key, WebPagew,OutputCollector<Text, LongWritable> out) { if(w.rank > 10) emit(w.url, w.rank); }

Execution Framework Analyzer Execution Optimizer varload‘value’ invokevirtual astore‘text’ … ifeq …

Execution Framework (SELECTf, w.rank>10) Analyzer Execution Optimizer void map(k, w) { out.set(indexedOutputFormat); emit(w.rank, (k,w)) } varload‘value’ invokevirtual astore‘text’ … ifeq … Analyzer in: user program Analyzerout: optimization descriptor index-generation program

Execution Framework (SELECT,“log.1.idx”,w.rank>10) (SELECTf, w.rank>10) Analyzer Execution Optimizer varload‘value’ invokevirtual astore‘text’ … ifeq … Optimizer in: optimization descriptor catalog Optimizer out: execution descriptor

Execution Framework (SELECT,“log.1.idx”,w.rank>10) Analyzer Execution Optimizer varload‘value’ invokevirtual astore‘text’ … ifeq … numwords 19519 Execution in: execution descriptor user program Execution out: program output

An Optimization Example //webpage.java: SCHEMA! Class WebPage {String URL,intrank,String content} //mapper.java void map(Text key, WebPage w) { if (w.url==‘teaparty.fr’) emit(w.url, 1); } • Data-centric programming idioms == relational ops PROJECTED view: (url,null,null) DIRECT-OP on compressed Webpage

Semantic Extraction • Query semantic are obvious to human readers, but not explicit in the code for framework • EXTRACT IT! • Static code analysis • Control-flow graph and data-flow graph • Find opportunities: selection, projection, direct op • Safe optimizations: same output

Analyzer: An Example //webpage.java Class WebPage {String URL,intrank,String content} //mapper.java map(Text key,Webpagew) { if (w.rank > 10) emit(w.url,w.rank); } Analyzer w.rank > 10 emit(url,rank) Fn Exit Fn Entry

Current Optimizations • B+-Tree for Selections • Projected views • Delta compression on numerics • Direct operation of compressed data • Hadoopcompression is not semantic aware

Experiments: Analyzer • Test MapReduce programs from Pavlo, SIGMOD ‘09: • Detected 5 out of 8 opportunities: • Two misses due to custom serialization class • Another miss requires knowledge of java.util.Hashtable semantics

Experiments: Performance • Optimize four Web page handling tasks: • Selection (filtering) • Projection (aggregation on subfield of page) • Join (pages to user visits) • User Defined Functions (aggregation) • 5 cluster nodes, 123GB of data

Experiments: Performance

Experiments: Performance • Up to 11x speedup over original Hadoop • Performance comparable to DBMS-X from Pavlo • UDF not detected: running time identical

Related Work • Lots of recent MapReduce activity • Quincy: Task scheduling(Isard et al, SOSP, 2009) • HadoopDB(Abouzeidet al, PVLDB 2009) • Hadoop++ (Dittrich et al, PVLDB 2010) • HaLoop(Bu et al, PVLDB 2010) • Twister (Ekanayakeet al, HPDC 2010) • Starfish (Herodotou et al, CIDR 2011) • Manimal does not introduce new optimizations. It detects and applies existing optimizations to code

Lessons Learned • The Good: We can recognize data processing idioms in real code. Relational operations still exist even in NoSQL world • The Ugly: When we started this project in 2009, we underestimated interest in writing in higher level languages (e.g., Pig Latin)

Conclusion • Manimal provides framework for applying well-known optimization techniques to MapReduce • Automatic optimizationof user code • Up to 11x speed increase • Provides framework for more optimizations

Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re

Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re

Presentation Transcript

Automatic Predicate Abstraction of C Programs

Automatic Predicate Abstraction of C-Programs

Proactive Re-optimization

Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs

Automatic Verification of Computer Programs

Michael Neuberg Christopher Picard

Automatic Verification of Computer Programs

Automatic Equivalence Checking of UF+IA Programs

Towards Automatic Optimization of MapReduce Programs (Position Paper)

Proactive Query Re-optimization

Automatic Synthesis and Optimization of Floating Point Hardware

Automatic Predicate Abstraction of C Programs

Proactive Re-Optimization

Code Optimization of Parallel Programs

Dynamic Optimization and Automatic Differentiation

Proactive Re-optimization

Optimization Re-Engineering Alliance

Challenges in Automatic Optimization of Arithmetic Circuits

Automatic Optimization

Dynamic Optimization and Automatic Differentiation

Performance tuning through Hadoop Mapreduce optimization