By Santosh Kumar Nukavarapu

A Comparison of Join Algorithms for Log Processing inMapReduceSpyros Blanas, Jignesh M. Patel,Vuk Ercegovac, Jun Rao,Eugene J. Shekita, Yuanyuan Tian By Santosh Kumar Nukavarapu

Contents • Introduction • Requirement • Log Processing and MapReduce • Join Algorithms in Map Reduce i) Overview of Repartition Join Algorithm ii)Outlook of Broadcast Join, Semi-Join, Per-Split Semi-Join • Experimental Evaluation • Results • Conclusion and Future Work

Introduction • Map Reduce is very popular in analysis of large datasets. Positives  • Hide’s the parallelization, fault tolerance and load balancing details through it’s framework. Negatives ): • Ignores many concepts of Parallel RDBMs. • Lack of declarative language, solid schema and indexes.

Requirement • Facebook,Yahoo,Google and many Web 2.0 companies are highly interested in Map Reduce. Why ? • log processing is very important data analysis that is required by these companies. • Map Reduce absolutely suit’s their Requirement.

Log Processing And Map Reduce What is Log Processing ? • Log of events such as click-stream,phone call records or sequence of transactions are collected and are stored in flat files. • Then these files are processed to compute various statistics to derive some business insights. Reasons to use Map Reduce for Log Processing : • Extremely large amount of Data involved.

2. Log records do not always follow the same schema. 3. Third, all the log records within a time period are typically analyzed together, making simple scans preferable to index scans. 4. Important to keep the job analysis going even in the event of failures

Problem Specification

Assumptions made for our JOIN ALGORITHMS IN MAPREDUCE • We consider an equi-join between a log table L and a reference table R on a single column. • L,R and the Join Result is stored in DFS. • Scans are used to access L and R. • Each map or reduce task can optionally implement two additional functions: init() and close() . • These functions can be called before or after each map or reduce task. L ⊲⊳L.k=R.k R, with |L| ≫ |R|

Algorithm1 : Repartition Join Map Phase : • Each map task works on a split of either R or L. • Each map task tags the record with its originating table. • Outputs the extracted join key and the tagged record as a (key, value) pair. • The outputs are then partitioned, sorted and merged by the framework.

Reducer Phase : • All the records for each join key are grouped together and eventually fed to a reducer. • For each join key, the reduce function first separates and buffers the input records into two sets according to the table tag. • Performs a cross-product between records in the above sets. Problem with this version of Algorithm : • All the records for a given join key from both L and R have to be buffered. • So, we can be out of memory ):

Improvement to Re-partition join

Experimental Evaluation

Results Picture taken from : A Comparison of Join Algorithms for Log Processing in MapReduce by Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao,Eugene J. Shekita, Yuanyuan Tian.

Picture taken from : A Comparison of Join Algorithms for Log Processing in MapReduce by Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao,Eugene J. Shekita, Yuanyuan Tian.

Conclusion • Joining log data with all kinds of reference data in MapReduce has emerged as an important part of analytic operations for : 1. Enterprise customers 2. Web 2.0 companies • Evaluated the join methods on a 100-node system. • Shown Unique tradeoﬀs of these join algorithms in the context of MapReduce. • Study can help an optimizer select the appropriate algorithm based on data. Future Work • Evaluating methods for multi-way joins. • Exploring indexing methods to speedup join queries, • Designing an optimization module that can automatically select the appropriate join algorithms.

References • Google labs • A Comparison of Join Algorithms for Log Processing in MapReduce by Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao,EugeneJ. Shekita, YuanyuanTian. • Wikipedia • Ibm.com

By Santosh Kumar Nukavarapu