MapReduce Programming Model

MapReduce Programming Model Based on Lin and Dryer’s text: Chapter 3

Job Tracker and Task Tracker • Figure 2.6

Tom White’s Wordcount

MapReduce Model • A programmer has no control over: • Where a mapper or reducer runs (i.e., on which node in the cluster). • When a mapper or reducer begins or finishes. • Which input key-value pairs are processed by a specific mapper. • Which intermediate key-value pairs are processed by a specific reducer.

Techniques for controlling execution and managing data flow • Ability to: • Construct complex data types as keys and values for storage, processing and communications • Specify and execute initialization code before a map and/or reduce and the same for termination code after map and/or reduce. • To preserve state across multiple keys in map and/or in the reduce • To control sorting order of intermediate keys • To control partitioning of key space, and thus the set of keys a particular reduce will process

Objective • Address the issues without creating bottleneck for scalability • Golden standard that MR attempts is sheer linear scalability • Storing and manipulating state has the potential of hindering scalability • How to improve performance? • Make the functions efficient? • Transfer of intermediate data efficient • Aggregation of intermediate data is an important operation for efficiency • Shrink the intermediate key space • What else can we do?

Mapper • http://hadoop.apache.org/common/docs/stable/api/org/apache/hadoop/mapreduce/Mapper.html • http://hadoop.apache.org/common/docs/stable/api/org/apache/hadoop/mapred/package-summary.html • http://www.slideshare.net/sh1mmer/upgrading-to-the-new-map-reduce-api

Mapper with built-in combiner-v1 class Mapper method Map(docida, doc d) H ← new AssociativeArray for all term t ∈ doc d do H{t} ← H{t} + 1 //Tally counts for entire document for all term t ∈ H do Emit(term t, count H{t})

Mapper with built-in combiner-v2 class Mapper method Initialize H ← new AssociativeArray method Map(docida, doc d) for all term t ∈ doc d do H{t} ← H{t} + 1 Tally counts across documents method Close for all term t ∈ H do Emit(term t, count H{t})

MapReduce Programming Model

MapReduce Programming Model

Presentation Transcript

MapReduce

MapReduce Programming

MapReduce

SPARC Programming Model

MapReduce Programming

MapReduce Programming Oct 25, 2011

CPU12 (Programming Model)

A Model of Computation for MapReduce

MapReduce

MapReduce

Data-Centric Programming: SQL Extensions and MapReduce

TOPIC : Programming Model

Mathematical Programming Model

MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka

MapReduce Programming

MapReduce Programming and Cluster Accessing Instructions

A Model of Computation for MapReduce

MapReduce

MapReduce Programming Oct 25, 2011

Programming Model