MapReduce : Simplified Data Processing on Large Clusters

MapReduce: Simplified Data Processing on Large Clusters Appendix A: Word Frequency Alex Newton Billy Coss

Contents • Abstract • Introduction • MapReduce • Word Frequency Analysis Sample Code

Abstract • MapReduce is a model used to analyze large amounts of data • Map creates key:value pairs, irrespective of duplicates • Reduce takes the key-value pairs created by the Map function and condenses them down to remove duplicate results

Introduction • Data analysts at Google frequently work on extremely large sets of raw data • Parallel computing is required to process datasets in a useful length of time • MapReduce was created as a form of abstraction for the details of parallelization, fault tolerance, data distribution, and load balancing

MapReduce Image taken from OSDI ‘04 Presentation by Jeff Dean and Sanjay Ghemawat.

Word Frequency Analysis Example Code • Code is divided into three functions • main • WordCounter • Adder • WordCounter is used for the Map function • Skips any leading whitespace and then parses words out of text • The word itself is the key, the value is 1 • Adder is used for the Reduce function • Iterates through keys, and adds the values of the same key together • Since the value is 1, this has the effect of incrementing a counter for the number of times a word is used

Sources J. Dean & S. Ghemawat (2004), MapReduce: Simplified Data Processing on Large Clusters. OSDI ‘04: 6th Symposium on Operating Systems Design and Implementation. pp. 137, 149. http://research.google.com/archive/mapreduce.html

MapReduce : Simplified Data Processing on Large Clusters