A Comparison of Join Algorithms for Log Processing in MapReduce

A Comparison of Join Algorithmsfor Log Processing in MapReduce Spyros Blanas, Jignesh M. Patel (University of Wisconsin-Madison) Eugene J. Shekita, YuanyuanTian (IBM Almaden Research Center) SIGMOD 2010 August 1, 2010 Presented by Hyojin Song

Contents • Introduction • Join Algorithms In MapReduce • Experimental Evaluation • Discussion • Conclusion

Introduction(1/3) • Log Processing • Important type of data analysis commonly done with MapReduce • A log of events • click-stream • log of phone call records • a sequence of transactions • To compute various statistics for business insight • filtered • aggregated • mined for patterns • Often needs to be join • Log data and Reference data(user information)

Introduction(2/3) • MapReduce Framework • Used to analyze large volumes of data • The success of MapReduce • Simple programming framework • To manage parallelization, fault tolerance, and load balancing • The critics of MapReduce • lack of a schema • lack of a declarative query language • lack of indexes • Difficult for joins • Not originally designed to combine information from several data sources • To use simple but inefficient algorithms to perform joins

Introduction(3/3) • The benefits of MapReduce for log processing • Scalability • China Mobile gathers 5-8TB of phone call records per day • Facebook collect almost 6TB of new log data everyday with totally 1.7PB • Schema free • flexibility • a log record may also change over time • Simple scans preferable (<-> index scans) • Time consuming work • gracefully fault tolerance support (<-> parallel RDBMS) • The goal of this paper • the implementation of several well-known join strategies in MapReduce • comprehensive experiments to compare these join techniques

Contents • Introduction • Join Algorithms In MapReduce • Experimental Evaluation • Discussion • Conclusion Problem Statement Repartition Join Improved Repartition Join Directed Join Broadcast Join Semi-Join Per-split Semi-Join

Join Algorithms in MRProblem Statement • An equi-join between a log table L and a reference table R on single column, with |L| >> |R| • To propose further improving its performance with some preprocessing techniques • Well-known in the RDBMS literature • Adapting them to MapReduce is not always straightforward • Crucial implementation details of these join algorithms • To implement two additional functions: init() and close() • These are called before and after each map or reduce task

Join Algorithms in MR1. Repartition Join • The most commonly used join strategy in the MapReduce framework • L and R are dynamically partitioned on the join key • The corresponding pairs of partitions are joined • Similar to partitioned sort-merge join in the parallel RDBMS • Example Tables(Log table & User table) • Log table • 500,000 records • Log has a lecture name and degree • User table • 10,000 records • Join key is the student ID

Join Algorithms in MR1. Repartition Join Map Phase Reduce Phase A split of R or L (Distributed File System) Intermediate results Local disk L Buffer R Song 2009-0078 An 2010-8281 ……. DB B 2008-2424 KRR A 2010-8281 NL D 2008-0909 MLC 2009-0078 OPT A 2005-3682 L . . .

Join Algorithms in MR1. Repartition Join Reduce Phase Local disk Buffer Output File (Distributed File System)

Join Algorithms in MR1. Repartition Join • Standard Repartition Join • Potential problem • all records have to be buffered. • May not fit in memory • The data is highly skewed • The key cardinality is small • Variants of the standard repartition join are used in Pig, Hive, and Jaql today. • They all suffer from the buffering problem • Improved Repartition Join • The output key is changed to a composite of the join key and the table tag • The partitioning & grouping function is customized • Records from the smaller table R are buffered and L records are streamed to generate the join output

Join Algorithms in MR2. Improved Repartition Join Map Phase Reduce Phase A split of R or L (Distributed File System) Intermediate results Local disk L Buffer R Song 2009-0078 An 2010-8281 ……. DB B 2008-2424 KRR A 2010-8281 NL D 2008-0909 MLC 2009-0078 OPT A 2005-3682 L . . .

Join Algorithms in MR2. Improved Repartition Join Reduce Phase Local disk Buffer Output File (Distributed File System)

Join Algorithms in MR3. Directed Join • Preprocessing for Repartition Join (Directed Join) • Both L and R have already been partitioned on the join key • Pre-partitioning L on the join key • Then at query time, matching partitions from L and R can be directly joined • A map-only MapReduce job. • During the init phase, Ri is retrieved from the DFS • To use a main memory hash table, if it’s not already in local storage

Join Algorithms in MR4. Broadcast Join • Broadcast Join • In most applications, |R| << |L| • Instead of moving both R and L across the network, • To broadcast the smaller table R to avoids the network overhead • A map-only job • Each map task uses a main-memory hash table for either L or R

Join Algorithms in MR4. Broadcast Join • Broadcast Join • If R < a split of L • To build the hash table on R • If R > a split of L • To build the hashtable on a split of L • Preprocessing for Broadcast Join • Most nodes in the cluster have a local copy of R in advance • To avoid retrieving R from the DFS in its init() function

Join Algorithms in MR5. Semi-Join • Semi-Join • Some applications, |R| << |L| • In Facebook, user table has hundreds of millions of records • A few million unique active users per hour • To avoid sending the records in R over the network that will not join with L • Preprocessing for Semi-Join • First two phases of semi-join can preprocess

Join Algorithms in MR6. Per-Split Semi-Join • Per-Split Semi-Join • The problem of Semi-join : All records of extracted R will not join Li • Li can be joined with Ridirectly • Preprocessing for Per-split Semi-join • Also benefit from moving its first two phases

Contents • Introduction • Join Algorithms In MapReduce • Experimental Evaluation • Discussion • Conclusion Environment Datasets MapReduce Time Breakdown Experimental Results

Experimental Evaluation1. Environment • System Specification • All experiments run on a 100-node cluster • Single 2.4GHz Intel Core 2 Duo processor • 4GB of DRAM and two SATA disks • Red Hat Enterprise Server 5.2 running Linux 2.6.18 • Network Specification • The 100 nodes were spread across two racks • Each node can execute two map and two reduce tasks concurrently • Each rack had its own gigabit Ethernet switch • The rack level bandwidth is 32Gb/s • Under full load, 35MB/s cross-rack node-to-node bandwidth • version 0.19.0, HDFS (128MB block size)

Experimental Evaluation2. Datasets • Datasets

Experimental Evaluation3. MapReduce Time Breakdown

Experimental Evaluation3. MapReduce Time Breakdown • MapReduce Time Breakdown • What transpires during the execution of a MapReduce job • The overhead of various execution components of MapReduce • System Environment • The standard repartition join algorithm • 500GB log table and 30MB reference table • 1% actually referenced by the log records • 4000 map tasks and 200 reduce tasks • A node was assigned 40 map and 2 reduce tasks

Experimental Evaluation3. MapReduce Time Breakdown • Interesting Observations on MapReduce • The map phase was clearly CPU-bound • The reduce phase was limited by the network bandwidth • Writing the three copies of the join result to HDFS • The disk and the network activities were moderate and periodic during map phase • The peaks were related to the output generation in the map task • The shuffle phase in the reduce task • Almost idle for about 30 seconds between the 9 min and 10 min mark • Waiting for the slowest map task • By enabling independent and concurrent map tasks, almost all CPU, disk and network activities can be overlapped

Experimental Evaluation4. Experimental Results ▣ No preprocessing ▣ preprocessing

Experimental Evaluation4. Experimental Results

Discussion • Choosing the Right Strategy • To determine what is the right join strategy for a given circumstance • To provide an important first step for query optimization

Conclusion • Joining log data with reference data in MapReduce has emerged as an important part • Analytic operations for enterprise customers • Web 2.0 companies • To design a series of join algorithms on top of MapReduce • Without requiring any modification to the actual framework • To propose many details for efficient implementation • Two additional function: Init(), close() • Practical preprocessing techniques • Future work • Multi-way joins • Indexing methods to speedup join queries • Optimization module (selecting appropriate join algorithms) • New programming models to extend the MapReduce framework

A Comparison of Join Algorithms for Log Processing in MapReduce