Hadoop++: Boosting Hadoop Performance with UDFs

CS775/875 Distributed Systems, Spring 2011 “Hadoop++ : Making Yellow Elephant Run Like a Cheetah (Without It Even Noticing)" Jens Dittrich, Jorge-Arnulfo Quiane-Ruiz, Alekh Jindal, Yagiz Kargin, Vinay Setty, Jorg Schad Saarland University International Max Planck Research School of Computer Science Presenter: Chinmay Lokesh

Introduction • MapReduce processes tasks in a scan oriented fashion • Hadoop often does not match a well-configured parallel DBMS. • Paper looks into Hadoop++ which boosts task performance without changing the framework • Achieved by injecting using UDF (User defined functions) • Hadoop++ outperfoms Hadoop, any change to Hadoop can reflect in Hadoop++ , Hadoop++ does not change Hadoop interface

DBMS Background Ongoing Debate on advantages and disadvantages of MapReduce versus parallel DBMS. • Some DBMS vendors integrate MapReduce front ends into their systems, however underlying execution system do not change and hence are databases • HadoopDB can be viewed as a data distribution framework combining local DBMS.

Drawback of HadoopDB It forces users to install, configure and use a parallel DBMS and HadoopDB changes interface to SQL. • HadoopDB locally uses ACID-compliant DBMS engines. However for MapReduce style analysis only indexing and join processing of local DBMS are useful HadoopDB requires deep changes to glue together the Hadoop and Hive frameworks.

Problems Tackled To build a system? • That still keeps the interface of MapReduce and Hadoop. • That achieves a parallel DBMS level performance • That does not change underlying Hadoop framework

Hadoop Plan – Physical Query Execution • Using Yahoo!’s Hadoop Version 0.19. Hadoop is nothing but a hard-coded, operator-free, physical query execution plan where ten User Defined Functions block, split, itemize, mem, map, sh, cmp, grp, combine, and reduce are injected at predetermined places. Hadoop’s hard-coded query processing pipeline is made explicit and represented as a DB-style physical query execution plan Plan is shaped by 3 user-defined parameters. M - Mappers R – Reducers P - data nodes For example paper shows 4 mappers,2 reducers and 4 nodes The plan consists of a subplan L and P subplans H1-H4 which is data load and M mapper subplans and R reducer subplans

Data Load Phase 1. Run a MapReduce job by loading data into HDFS. 2. Partition the input T horizontally into disjoint subsets 3. These subsets are called blocks. 4. Each block is replicated , default number of replicas used by Hadoop is 3. 5. Hadoop stores replicas of same block in different nodes

Map Phase Each map subplan for eg M1-M4 reads a subset of data called split. Split can be made up of one or more blocks – UDF split. For example: For a map subplan M1 the split assigned consists of 2 blocks which are from subplan H1. Subplan M1 unions the input blocks and breaks them into records. UDF itemize is used to divide a ‘split’ into items. M1 calls maps on each item and passes output to a PPart operator. UDF mem divides the output into ‘spills’. Spills are 80% size of available main memory.

Map Phase LPart logically partitions each spill into different regions containing data belonging to different reducers. UDF sh (shuffle) determines a reducer for each tuple created. Each logical partition is then sorted (Sort) using sort order defined by UDF cmp and data is grouped using (SortGrp). For each group MMap calls UDF which produces data. Each subplan can yield data of different size

Shuffle Phase The shuffle phase redistributes data using a partitioning UDF sh. For example reducer R1 and R2 are considered: • R1 and R2 fetch data from mapper subplan using ‘Fetch’ operator. • Reducer subplans retrieve the input files entirely from the mapper subplan and store them in main memory in a Buffer. • If input data does not fit into main memory, those files will be stored onto disk in the reducer subplans. • Inputs from subplans are merged. Here the inputs are outputs of mapper phase

Reduce Phase Reduce phase starts once a single output stream in produced • Result of the Merge is grouped (SortGrp) and for each group MMap calls reduce. • Result is stored on disk (Store) • MapReduce framework does not provide a single result output file per reducer. Result of MapReduce is the union of those files.

TROJAN INDEX The Hadoop Plan uses a Scan operator to read data from disk Hadoop does not have index access of data as there is no prior knowledge of schema. Trojan index introduces DBMS style indexing into Hadoop.

Salient Features of Trojan Index 1 . No External Library or Engine: Trojan Index integrates indexing capability natively into Hadoop without imposing a distributed SQL-query engine on top of it. 2. Non-Invasive: No change made to the existing Hadoop framework. Index structure is implemented by providing the right UDFs. 3. Optional Access Path: Trojan Index provides an optional index access path which can be used for selective MapReduce jobs. The scan access path can still be used for other MapReduce jobs.

Salient Features of Trojan Index 4. Seamless Splitting: Data indexing adds an index overhead (∼ 8MB for 1GB of indexed data) for each data split. The new logical split includes the data as well as the index. The approach takes care of automatically splitting indexed data at logical split boundaries. Still data and indexes may be kept in different physical objects, e.g. if the index is not required for a particular task. 5. Partial Index: Trojan Index need not be built on the entire split, it can be built on any contiguous subset of the split as well. This is helpful when indexing one out of several relations, co-grouped in the same split. 6. Multiple Indexes: Several Trojan Indexes can be built on the same split. However, only one of them can be the primary index. During query processing, an appropriate index can be chosen for data access.

TROJAN JOIN In MapReduce two datasets are usually joined using re-partitioning. Partitioning records by join key in the map phase and grouping records in each key-based group. Trojan Join supports more effective join by assuming we know the schema and workload Idea is to co-partition the data at load time. Joins are locally processed within each node at query time, but we are free to group the data on any attribute other than the join attribute in same mapreduce job

Salient Features of TROJAN JOIN 1. Non-Invasive. No change is made to the existing Hadoop framework. Only the internal representation of a data split is changed 2. Seamless Splitting. When co-grouping the data, create three headers per data split: two for indicating the boundaries of data belonging to different relations; one for indicating the boundaries of the logical split. Trojan Join automatically splits data at logical split boundaries that are opaque to the user. 3. Mapper-side Co-partitioned Join. Trojan Join allows users to join relations in the map phase itself exploiting co-partitioned data. This avoids the shuffle phase, which is typically quite costly from the network traffic perspective. 4. Trojan Index Compatibility. Trojan indexes may freely be combined with Trojan Joins.

Conclusion Hadoop++ operates exactly as MapReduce by passing the same key-value tuples to the map and reduce functions. However, similarly to HadoopDB, Hadoop++ also allows : 1. To perform index accesses whenever a MapReduce job can exploit the use of indexes, and 2. To co-partition data so as to allow map tasks to compute joins results locally at query time. The results of experiments demonstrate that Hadoop++ can have better performance than HadoopDB. However, in contrast to the latter Hadoop++ does not force users to use SQL and DBMSs.

Hadoop++: Boosting Hadoop Performance with UDFs

Hadoop++: Boosting Hadoop Performance with UDFs

Presentation Transcript

The Glassfish Application Server

DISRET: A Distributed System To Support a Retail-chain

Teaching