Performance Comparison of Code Generated Expression Evaluation in Shark vs Hive
This document outlines the code generation (CG) examples and performance comparison between CG.ExprEval and Hive.ExprEval, highlighting why CG.ExprEval is faster. Key points include common inefficiencies in Hive.ExprEval, such as repeated evaluations of sub-expressions, unnecessary type conversions, and extensive runtime function calls. The design and major class diagram of the CG system are discussed, along with various implemented user-defined functions (UDFs) and future work plans to enhance features and support for collection types. Important settings for enabling/disabling features in Hive are also mentioned.
Performance Comparison of Code Generated Expression Evaluation in Shark vs Hive
E N D
Presentation Transcript
Code Gen of ExprEval in Shark hao.cheng@intel.com
Outlines • CG examples • Performance Comparison (CG ExprEval V.S. Hive ExprEval) • CG Design & Major Class Diagram • Implemented UDFs/Generic UDFs • Future Works
CG Examples shark.expr.cg=true/false in hive-site.xml to enable/disable the feature; default is true.
Performance Comparison (CG ExprEval V.S. Hive ExprEval) 747,747,840 records / 66,909,023,675 bytes / RC File (with LzoCodec) on 4 Slaves Machines
Performance Comparison (CG ExprEval V.S. Hive ExprEval) (2) • Why CG ExprEval is Faster than Hive ExprEval? In Hive ExprEval: • Keep re-evaluating the common sub node expressions • e.g. in expression: concat(year(date_add(visitDate,7)), '/', month(date_add(visitDate,7)), '/', day(date_add(visitDate,7))), the “date_add(visitDate,7)” will be evaluated 3 times. • Keep checking data types in the runtime • The parameter types of “evaluate” method in GenericUDFs is uncertain until runtime, and Hive ExprEval have to keep checking the value types inside of the “evaluating”. e.g. GenericUDFOPGreaterThan.evaluate, GenericUDFPrintf.evaluate etc. • Un-necessary type converting • e.g. in expression: (duration + 1.03), variable “duration” will be converted into a new object FloatWritable first in Hive ExprEval, which creates lots of small temperate objects (GenericUDFBridge.conversionHelper) • Large mount of virtual function calls in runtime • Hive ExprEvalalways use the base class objects, particularly the UDF objects and the field value objects • Using the Java Reflection to call UDF evaluate() method • Hive ExprEvalsaccess the UDF (in class GenericUDFBridge) is based on the Java Reflection API, which cause another performance issue (http://docs.oracle.com/javase/tutorial/reflect/index.html) CG ExprEval Generates Source Code with concrete objects and executing branches.
CG Design & Major Class Diagram (2) • Why not generate the bytecode directly? • The generated content is quite complicated, source code is much easier to debug / troubleshooting. • Java complier could do another optimizations when compile the source code. • Why not generate the evaluating source code according to Hive ExprNodeEvaluator tree, but the ExprNodeDesc tree? • ExprNodeEvaluator tree loss some information, which may be helpful for further optimization. (e.g. the common sub node expression evaluating) • Extracting the information from the ExprNodeEvaluator tree is kind of tough, as most of the variables are protected / private in ExprNodeEvaluator.
Implemented UDFs/Generic UDFs • Supported Features: • Relational Operators (=,!=,<,<= etc.) • Arithmetic Operators (+,-,*,/,% etc.) • Logical Operators (AND,OR,NOT etc.) • Built-in Functions(UDF) and existed User-Defined Functions • Partial of the generic UDF • GenericUDFBetween • GenericUDFPrintf • GenericUDFInstr • GenericUDFBridge • Unsupported Features • Conditional Functions (if/case/when etc.) • Map/Array • UDAF • UDTF • Misc. Functions (java_method/reflect/hash etc.)
Future Works • Generated Java Source Compile once and distribute among the cluster • Reuse the Generated .class for the same queries • Support more General UDF (case/when/if etc.) • Support Collection Type(Array/Map etc.) • Code Gen in Aggregations