11. MapReduce

11. MapReduce

Life in the fast lane … http://www.mapreduce.org/

A Bit of History The exponential growth of data first presented challenges to cutting-edge web businesses such as Google, Yahooand Amazon. They needed to go through terabytes and petabytes of data to figure out which websites were popular, what books were in demand, and what kinds of ads appealed to people. Existing tools were becoming inadequate to process such large data sets quickly. Google was the first to publicize MapReduce- a system they developed and used to scale their data processing needs.

Data-Intensive Information Processing Applications in the Cloud Data-intensive information processing applications need scalable approaches to processing huge amounts of information (Terabytes and even petabytes). In cloud computing, the focus today is mostly on MapReduce, which is presently the most accessible and practical means of computing at this scale. (But other approaches are available as well. In fact, Google is replacing MapReduce with something better - FlumeJava is a Java library designed by Google to provide a simple mechanism for implementing a series of MapReduce operations.)

Motivation – Google Example 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30-35 MB/sec from disk => ~4 months to read the web => ~1,000 hard drives to store the web Even more time to do something with the data! E.g. Google search!

The Need • Special-purpose programs to process large amounts of data E.g. crawled documents, Web Query Logs, etc. • At Google and others (Yahoo!, Amazon, Facebook, …) • Inverted index - mapping from content, such as words, to its locations in a database file • Graph structure of Web documents • Summaries of number of pages/host, set of frequent queries, etc. • Advert Optimization • Spam filtering

Commodity Clusters Web data sets are massive - Tens to hundreds of terabytes, petabytes, … Cannot data mine on a single server. Standard architecture emerged: - Cluster of “commodity” Linux nodes with gigabit Ethernet interconnectivity – data center How to organize computations on this architecture? - Mask issues such as hardware failure

MapReduce is a programming model for expressing distributed computations on massive amounts of information, and an execution framework for large-scale information processing on clusters of commodity servers. It was originally developed by Google, and built on well-known principles in parallel and distributed processing dating back several decades. MapReducehas since enjoyed widespread adoption via an open-source implementation called Hadoop, whose development was led by Yahoo!.

MapReduce • A reaction to “RDBMs do not scale” and administrative costs – System community solution to big data problem • MapReduce success fueled by – Massively increasing data sizes – Scalability (E.g. 4,000 node single cluster at Yahoo) – Declining cost of computing hardware – Sequential access pattern coupled with brute force • MapReduce great for – Extract, Transform and Load problems – Dirty data, weak schema, and access patterns not well suited to indexes – Executing arbitrary or complex functions over all data

Large scale computing for data mining problems on commodity hardware - PCs (aka servers) connected in a network - Process huge datasets on many computers Challenges: How do you distribute computation? - Distributed/parallel programming is hard!! - Machines fail! MapReduce addresses all of the above - Google’s computational/data manipulation model - Elegant way to work with big data

Implications of such computing environment: • - Single machine performance does not matter. • - Just add more machines for better performance! • Machines break: • - One server may stay up for 3 years (1,000 days). • - If you have 1,0000 servers, expect to loose 1 server/day • How can we make it easy to write distributed programs?

The power of MapReduce lies in its ability to scale to 100s or 1000s of computers, each with several processor cores. How large an amount of work? - Web-scale data on the order of 100s of GBs to TBs or PBs It is highly likely that an application’s input data set will not fit on a single computer’shard drive - Hence, a distributed file system (e.g. the Google File System- GFS or the Hadoop File System - HDFS) is typically required. Aside: Google are replacing GFS with Colossus.

Idea and Solution • Idea: Bring computation close to the data! • - Do not transport data to CPUs. • Store files multiple times for reliability! • Need: Programming model • - MapReduce – same program operating on many different (i.e. partitioned) data sets. • Infrastructure – File system • - Google: GFS • - Hadoop: HDFS

Stable Storage • Problem: If nodes fail, how to store data persistently? • Answer: Distributed File System with replication. • Provides global file namespace • E.g. Google GFS; Hadoop HDFS • Typical usage pattern - Huge files (100s of GB to TB) • - Data is rarely updated in place • - Reads and appends are common

Distributed File System Chunk Servers: File is split into contiguous chunks - Typically each chunk is 16-128MB - Each chunk replicated (usually 2x or 3x) - Try to keep replicas in different racks! Master node: aka Name Node in Hadoop’s HDFS - Stores metadata about location of chunks - Might be replicated! Client library for file access: - Talks to master to find chunk servers - Connects directly to chunk servers to access data

Distributed File System Reliable distributed file system for petabyte scale Data kept in “chunks” spread across thousands of machines Each chunk replicated on different machines Seamless recovery from disk or machine failure

Data Distribution In a MapReduce cluster, data is distributed to all the nodes of the cluster as it is being loaded in. An underlying distributed file systems (e.g. GFS) splits large data files into chunks which are managed by different nodes in the cluster. Even though the file chunks are distributed across several machines, they form a single namespace.

Overview of MapReduce 1. Read a lot of data 2. Map: Extract something you are interested in 3. Shuffle and Sort 4. Reduce: Aggregate, summarize, filter or transform 5. Write the result Outline stays the same, Map and Reduce change to fit the problem

In MapReduce, chunks are processed in isolation by tasks called Mappers. The outputs from the mappers are denoted as intermediate outputs (IOs) and are brought into a second set of tasks called Reducers. The process of bringing together IOs into a set of Reducers is known as the shuffling process. The Reducers produce the final outputs (FOs). Overall, MapReduce breaks the data flow into two phases,the map phase and the reduce phase.

MapReduce MapReduce is a programming model for processing and generating large data sets. Basically, … Users specify a map function that processes akey/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. MapReducehas been used for large-scale graph processing, text processing, machine learning, and statistical machine translation. The Hadoop open source implementation of MapReduce is extensively used.

Programming Framework

General Idea of MapReduce • Suppose we have a large number of documents in folder docs to analyze • E.g. Count occurrences of words… with Unix commands: • words(docs/*) | sort | uniq-c • where words takes a file and outputs the words in it, one per a line. • This captures the essence of MapReduce • Great thing is it is naturally parallelizable • MapReduce comprises two steps • - Map step, followed by • - Reduce step

MapReduce Model influenced by functional programming languages, i.e., Lisp • Defines a Map operation, i.e. tagging elements to be processed together and • Defines a Reduce function, i.e. a way to combine the individual results to obtain the final answer

MapReduce in Practice When we start a Map/Reduce workflow, the program will splitthe input into segments, passing each segment to a different machine (VM). Each machine then runs the Map script on the portion of data attributed to it. The purpose of the Map script is to map the data into <key, value> pairs for the Reduce script to aggregate. - So, the Map script (which you write!) takes some input data, and maps it to <key, value> pairs according to your specifications. For example, if we wanted to count word frequencies in a text, we would have <word, count> be our <key, value> pair. - Our Map script, then, would emit a <word, 1> pair for each word in the input stream.

Notethat the Map script does noaggregation(i.e. actual counting) – This is what the Reduce script is for. Emitted <key, value> pairs are then “shuffled”, which basically means that pairs with the same key are grouped, and passed to a single machine, which will then run the Reduce script over them. (There is more than one machine running the Reduce script) The Reduce script (which you also write!) takes a collection of <key, value> pairs and “reduces” them according to the user‐specified Reduce script. In our word count example, we want to count the number of word occurrences so that we can get frequencies. - Thus, we would want our Reduce script to simply sum the values of the collection of <key, value> pairs which have the same key.

MapReduceExample • Input is file with one document per record • User specifies Map function – key = document URL – value = terms that document contains Example

• MapReduce library gathers together all pairs with the same key value (shuffle/sort phase) • The user-defined Reduce function combines all the values associated with the same key

In general,

Skeleton Java program for MapReduce import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; public class WordCount{ public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>{ … } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>{ … } public static void main(String[]args) throws IOException{ … } } http://developer.yahoo.com/hadoop/tutorial/module4.html#wordcount

Programming Cloud Applications - New Parallel Programming Paradigms: MapReduce and friends • Highly-parallel data-processing • MapReduce - originally designed by Google • Open-source version called Hadoop,developed by Yahoo! • Hadoop written in Java. • Indexing: a chain of 24 MapReduce jobs • ~200K jobs processing 50PB/month • Yahoo! (Hadoop + Pig) • WebMap: a chain of 100 MapReduce jobs • 280 TB of data, 2500 nodes, 73 hours • Annual Hadoop Summit: 2008 had 300 attendees, now close to 1000 attendees

Commodity Clusters MapReduceis designed to efficiently process large volumes of data by connecting many commodity (i.e. cloud) computers (aka servers) together to work in parallel. A theoretical 1000-CPU machine would cost a very large amount of money, far more than 1000 single-CPU or 250 quad-core machines – so the MapReduce approach is cost effective. MapReduceties smaller and more reasonably priced machines together into a single cost-effective commodity cluster

MapReduce via Google • The problem • Many simple operations in Google • Grep for data, compute index, compute summaries, etc • But the input data is large, really large! • The whole Web, billions of pages! • Google has lots of machines (clusters of 10K+, etc.) • Many computations over very large datasets • Question is - How do you use large number of machines efficiently? • Can reduce computational model down to two steps … • Map: take one operation, apply to many, many data tuples • Reduce: take results and aggregate them • MapReduce • A generalized interface for massively parallel cluster processing

MapReduce in Practice Computation is organized around a (possibly large) number of worker processes. One process is elected as the Master process, the others are slaves. Slaves may be Map processes or Reduce processes. Dataset is broken up into small pieces and distributed to the Map workers.

Map Reduce Architecture Input Job (mapper, reducer, input) Jobtracker Assign tasks tasktracker tasktracker tasktracker Data transfer • Each node is part of a HDFS cluster – bring program to data. • Input data is stored in HDFS spread across nodes and replicated. • Programmer submits job (mapper, reducer, input) to Job tracker • Job tracker - Master • Splits input data. • Schedules and monitors various Map and Reduce tasks. • Task tracker - Slaves • Execute Map and Reduce tasks

MapReduce in One Picture

MapReduce Execution Overview

MapReduce Programming Model • Intuitively just like those from functional languages • Scheme, Lisp, Haskell, etc • Map:initial parallel computation • map (in_key, in_value) -> list(out_key, intermediate_value) • In: a set of key/value pairs • Out: a set of intermediate key/value pairs • Note that keys might change during Map • Reduce:aggregation of intermediate values by key • reduce (out_key, list(intermediate_value)) -> list(out_value) • Combines all intermediate values for a particular key • Produces a set of merged output values (usually just one)

Isolated Tasks MapReducedivides the workload into multiple independent tasks and schedules them across cluster nodes. The work performed by each task is done in isolation from one another. The amount of communication which can be performed by tasks is mainly limited for scalability and fault tolerance reasons. - The communication overhead required to keep the data on the nodes synchronized at all times would prevent the model from performing reliably and efficiently at large scale.

Dataflow Input and final output are stored on a distributed file system - Scheduler tries to schedule Map tasks “close” to physical storage location of input data. Intermediate results are stored on local filestore of Map and Reduce workers Output is often input to another MapReducetask

Coordination Master data structures: Task status: (idle, in-progress, completed) Idle tasks get scheduled as workers become available When a Map task completes, it sends the Master worker the location and sizes of its R intermediate files, one for each Reducer Master pushes this information to Reducers Master “pings” workers periodically to detect failures

Failures Map worker failure - Map tasks completed or in-progress at worker are reset to idle. - Reduce workers are notified when task is rescheduled on another worker. Reduce worker failure - Only in-progress tasks are reset to idle. Master failure - MapReduce task is aborted and client is notified

How many Map and Reduce Jobs? M map tasks, R reduce tasks Rule of thumb: Make M and R much larger than the number of nodes in cluster One DFS chunk per Map task is common - Improves dynamic load balancing and speeds recovery from worker failure. Usually R is smaller than M because output is spread across R files

To illustrate the MapReduce programming model, consider the problem of counting the number of occurrences of each word in a large collection of documents. The user would write code similar to the following pseudo-code: map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, “1”); The map function emits each word plus an associated count of occurrences (just `1' in this simple example).

reduce(String key, Iterator values): • // key: a word • // values: a list of counts • int result = 0; • for each v in values: • result += ParseInt(v); • Emit(AsString(result)); • The reduce function sums together all counts emitted for a particular word. • MapReduceautomaticallyparallelizes and executes the program on a large cluster of commodity machines. • The runtime system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing required inter-machine communication.

Example • Goal • Count number of occurrences of each word in many documents • Sample data … • Page 1: the weather is good • Page 2: today is good • Page 3: good weather is good • So what does this look like in MapReduce? map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));

map feed reduce Map/Reduce in Action • Worker 1: • (the 1) • Worker 2: • (is 1), (is 1), (is 1) • Worker 3: • (weather 1), (weather 1) • Worker 4: • (today 1) • Worker 5: • (good 1), (good 1), (good 1), (good 1) Page 1: the weather is good Page 2: today is good Page 3: good weather is good • Worker 1: • (the 1), (weather 1), (is 1), (good 1) • Worker 2: • (today 1), (is 1), (good 1) • Worker 3: • (good 1), (weather 1), (is 1), (good 1) • Worker 1: (the 1) • Worker 2: (is 3) • Worker 3: (weather 2) • Worker 4: (today 1) • Worker 5: (good 4)

11. MapReduce

11. MapReduce

Presentation Transcript

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka

MapReduce

MapReduce