Map Reduce - an overview

Ghana Map Reduce - an overview

AGENDA • Understanding MapReduce • Map Reduce - An Introduction • Word count – default • Word count – custom

Map Reduce • Programming model to process large datasets • Supported languages for MR • Java • Ruby • Python • C++ • Map Reduce Programs are Inherently parallel. • More data  more machines to analyze. • No need to change anything in the code.

Understanding MapReduce • Start with WORDCOUNT example • “Do as I say, not as I do”

Understanding MapReduce pseudo code define wordCount as Map<String,long>; for each document in documentSet { T = tokenize(document); for each token in T { wordCount[token]++; } } display(wordCount); • This works until the no.of documents to process is not very large

Understanding MapReduce -pseudo code • Spam filter • Millions of emails • Word count for analysis • Working from a single computer is time consuming • Rewrite the program to count form multiple machines

Understanding MapReduce -pseudo code • How do we attain parallel computing ? • All the machines compute fraction of documents • Combine the results from all the machines

Understanding MapReduce -pseudo code STAGE 1 define wordCount as Map<String,long>; for each document in documentSUBSet{ T = tokenize(document); for each token in T { wordCount[token]++; } }

Understanding MapReduce -pseudo code STAGE 2 define totalWordCount as Multiset; for each wordCount received from firstPhase { multisetAdd(totalWordCount, wordCount); } Display(totalWordcount)

Understanding MapReduce -pseudo code Master Comp-1 Comp-2 Documents Comp-3 Comp-4

Understanding MapReduce -pseudo code • Problems • STAGE 1 • Documents segregations to be well defined • Bottle neck in network transfer • Data-intensive processing • Not computational intensive • So better store files over processing machines • BIGGEST FLAW • Storing the words and count in memory • Disk based hash-table implementation needed Master Comp-1 Comp-2 Documents Comp-3 Comp-4

Understanding MapReduce -pseudo code • Problems • STAGE 2 • Phase 2 has only once machine • Bottle Neck • Phase 1 highly distributed though • Make phase 2 also distributed • Need changes in Phase 1 • Partition the phase-1 output (say based on first character of the word) • We have 26 machines in phase 2 • Single Disk based hash-table should be now 26 Disk based hash-table • Word count-a , worcount-b,wordcount-c Master Comp-1 Comp-2 Documents Comp-3 Comp-4

Understanding MapReduce -pseudo code Master Comp-1 Comp-10 Comp-2 Comp-20 Documents Comp-3 Comp-30 . . . Comp-4 Comp-40

Understanding MapReduce -pseudo code • After phase-1 • From comp-1 • WordCount-A  comp-10 • WordCount-B  comp-20 • . • . • . • Each machine in phase 1 will shuffle its output to different machines in phase 2

Word Count -- retrospection • This is getting complicated • Store files where are they are being processed • Write disk-based hash table obviating RAM limitations • Partition the phase-1 output • Shuffle the phase-1 output and send it to appropriate reducer

Word Count -- retrospection • This is more than a lot for word count • We haven’t even touched the fault tolerance • What if comp-1 or com-10 fails • So, A need of frame work to take care of all these things • We concentrate only on business

Understanding MapReduce -pseudo code Interim output MAPPER REDUCER Master Comp-1 Comp-10 Comp-20 Comp-2 Shuffling Documents Partitioning HDFS Comp-3 Comp-30 . . . Comp-40 Comp-4

MapReduce • Mapper • Reducer Mapper filters and transforms the input Reducer collects that and aggregate on that. Extensive research is done two arrive at two phase strategy

MapReduce • Mapper,Reducer,Partitioner,Shuffling • Work together  common structure for data processing

MapReduce - WordCount • Mapper • <key,words_per_line> : Input • <word,1> : output • Reducer • <word,list(1)> : Input • <word,count(list(1))> : Output

MapReduce • As said, don’t store the data in memory • So keys and values regularly have to be written to disk. • They must be serialized. • Hadoop provides its way of deserialization • Any class to be key or value have to implement WRITABLE class.

MapReduce

Word Count – default • Let’s try to execute the following command • hadoopjar hadoop-examples-0.20.2-cdh3u4.jar wordcount • hadoop jar hadoop-examples-0.20.2-cdh3u4.jar wordcount<input> <output> • What does this code do ?

CUSTOM WORD-COUNT • Switch to eclipse

Map Reduce - an overview

Map Reduce - an overview

Presentation Transcript

An Overview

Map Reduce - an overview

An Event Road Map

MAP-21 Overview and Implementation

An Overview

Overview / Road Map / Approach

MAP: Basics Overview

An Overview

An overview

An Overview

An Overview

An overview

AN OVERVIEW

Overview of INGD Map Shapes

MAP-21 National Performance Management Measures An Overview of Transportation

AN OVERVIEW

An Overview

City Interactive Map - Overview

An Overview

An Overview of Map-Reduce Research