320 likes | 1.95k Vues
Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re. August 2011. MapReduce is victorious. Google statistics: Hadoop statistics: 7 PB+ Vertica clusters vs. 22 PB+ Cloudera H adoop clusters 1.
E N D
Automatic optimization of MapReduce ProgramsMichael Cafarella, EamanJahani, Christopher Re August 2011
MapReduceis victorious • Google statistics: • Hadoop statistics: 7 PB+ Vertica clusters vs. 22 PB+ ClouderaHadoop clusters1 1. Omer Trajman, Cloudera VP, http://www.dbms2.com/
MapReduce in relational land • Designers original Intention: free-formed data • web-scale indexing/log processing • But, many relational workloads1 • Complex queries/data analysis • Caveat: MR performance lags RDBMS performance Karmaspherecorporation: A study of hadoopdevelopers, http://karmasphere.com, 2010
Selection is Slower with MapReduce Pavlo et al., A Comparison of Approaches to Large-Scale Data Analysis, SIGMOD 2009
Join is Even Slower Pavlo et al., A Comparison of Approaches to Large-Scale Data Analysis, SIGMOD 2009
MR Lags in Relational Land • Stonebraker, Dewitt: ''MapReduce has no indexes and therefore has only brute force as a processing option. It will be creamed whenever an index is the better access mechanism.’’1 • Query processing tasks • No metadata, semantics, indices • Free-formed input is a double-edged sword 1. MapReduce: a major step backwards, http://databasecolumn.vertica.com/, 2008
Manimal • Manimal is a hybrid system, combining MapReduce programming model and well-known execution techniques • Techniques today only found in RDBMS, but shouldbe in MapReduce, too.
Manimal Approach • MR Engine • Static Analyzer • Optimizer logic • Execution Framework void map(Text key, WebPage w) { if(w.rank > 10) emit(w.url,w.rank); } • Challenges: • Safely detect query semantic optimization • How much performance gain? optimization execution bytecode *.class path opportunities SELECTION from B+Tree index on W.RANK
Manimal Contributions • Our Manimal system: • Detect safe relational optimizations in users’ compiled MapReduce programs • Our results: • Runs with unmodified MapReduce code • Runs up to 11x faster on same code • Provides framework for more optimizations
Outline • Introduction • Execution Framework • Optimization/Analyzer Examples • Experiments • Analyzer recall • Performance gain • Related Work and Conclusion
Execution framework public void map(Text key, WebPagew,OutputCollector<Text, LongWritable> out) { if(w.rank > 10) emit(w.url, w.rank); }
Execution Framework Analyzer Execution Optimizer varload‘value’ invokevirtual astore‘text’ … ifeq …
Execution Framework (SELECTf, w.rank>10) Analyzer Execution Optimizer void map(k, w) { out.set(indexedOutputFormat); emit(w.rank, (k,w)) } varload‘value’ invokevirtual astore‘text’ … ifeq … Analyzer in: user program Analyzerout: optimization descriptor index-generation program
Execution Framework (SELECT,“log.1.idx”,w.rank>10) (SELECTf, w.rank>10) Analyzer Execution Optimizer varload‘value’ invokevirtual astore‘text’ … ifeq … Optimizer in: optimization descriptor catalog Optimizer out: execution descriptor
Execution Framework (SELECT,“log.1.idx”,w.rank>10) Analyzer Execution Optimizer varload‘value’ invokevirtual astore‘text’ … ifeq … numwords 19519 Execution in: execution descriptor user program Execution out: program output
Outline • Introduction • Execution Framework • Optimization/Analyzer Examples • Experiments • Analyzer recall • Performance gain • Related Work and Conclusion
An Optimization Example //webpage.java: SCHEMA! Class WebPage {String URL,intrank,String content} //mapper.java void map(Text key, WebPage w) { if (w.url==‘teaparty.fr’) emit(w.url, 1); } • Data-centric programming idioms == relational ops PROJECTED view: (url,null,null) DIRECT-OP on compressed Webpage
Semantic Extraction • Query semantic are obvious to human readers, but not explicit in the code for framework • EXTRACT IT! • Static code analysis • Control-flow graph and data-flow graph • Find opportunities: selection, projection, direct op • Safe optimizations: same output
Analyzer: An Example //webpage.java Class WebPage {String URL,intrank,String content} //mapper.java map(Text key,Webpagew) { if (w.rank > 10) emit(w.url,w.rank); } Analyzer w.rank > 10 emit(url,rank) Fn Exit Fn Entry
Current Optimizations • B+-Tree for Selections • Projected views • Delta compression on numerics • Direct operation of compressed data • Hadoopcompression is not semantic aware
Outline • Introduction • Execution Framework • Optimization/Analyzer Examples • Experiments • Analyzer recall • Performance gain • Related Work and Conclusion
Experiments: Analyzer • Test MapReduce programs from Pavlo, SIGMOD ‘09: • Detected 5 out of 8 opportunities: • Two misses due to custom serialization class • Another miss requires knowledge of java.util.Hashtable semantics
Experiments: Performance • Optimize four Web page handling tasks: • Selection (filtering) • Projection (aggregation on subfield of page) • Join (pages to user visits) • User Defined Functions (aggregation) • 5 cluster nodes, 123GB of data
Experiments: Performance • Up to 11x speedup over original Hadoop • Performance comparable to DBMS-X from Pavlo • UDF not detected: running time identical
Outline • Introduction • Execution Framework • Optimization/Analyzer Examples • Experiments • Analyzer recall • Performance gain • Related Work and Conclusion
Related Work • Lots of recent MapReduce activity • Quincy: Task scheduling(Isard et al, SOSP, 2009) • HadoopDB(Abouzeidet al, PVLDB 2009) • Hadoop++ (Dittrich et al, PVLDB 2010) • HaLoop(Bu et al, PVLDB 2010) • Twister (Ekanayakeet al, HPDC 2010) • Starfish (Herodotou et al, CIDR 2011) • Manimal does not introduce new optimizations. It detects and applies existing optimizations to code
Lessons Learned • The Good: We can recognize data processing idioms in real code. Relational operations still exist even in NoSQL world • The Ugly: When we started this project in 2009, we underestimated interest in writing in higher level languages (e.g., Pig Latin)
Conclusion • Manimal provides framework for applying well-known optimization techniques to MapReduce • Automatic optimizationof user code • Up to 11x speed increase • Provides framework for more optimizations