IDS594 Special Topics in Big Data Analytics

IDS594 Special Topics in Big Data Analytics Week 3

What is MapReduce?

General Framework of a MapReduce Program mapper class{ // map() function } reducer class{ // reduce() function } driver class{ // parameters about job configurations }

MapReduce Pipeline

Mapper • map(inKey, inValue)  list of (intermediateKey, intermediateValue) • The input to the map function is the form of key-value pairs, even though the input to a MapReduce program is a file or files. • By default, the value is a data record and the key is the offset of the data record from the beginning of the data file.

Mapper • The output consists of a collection of key-value pairs which are input for the reduce function. • Word count example. • The input to the mapper is each line of the file, while the output from each mapper is a set of key-value pairs where one word is the key and the number 1 is the value. • (5539, “I am taking IDS594 class. This class is fun.”)  (I, 1) (am, 1) (taking, 1) (IDS594, 1) (class, 1) (This, 1) (Class, 1) (is, 1) (fun, 1)

Mapper Class public class MapperClassextends MapReduceBaseimplements Mapper<inKeyTypeinKey, inValueTypeinValue, intermediateKeyTypeintermediatekey, intermediateValueTypeintermediateValue> { // some variable definitions here; public void map(inKeyTypeinKey, inValueTypeinValue, OutputCollector<intermediateKeyType, intermediateValueType> output, Reporter reporter) throws IOException { // implementation body; } }

Reducer • reduce(intermediateKey, list(intermediateValue))  list(outKey, outValue) • Each reduce function processes the intermediate values for a particular key generated by the map function and generates the output. • There exists a one-one mapping between keys and reducers. They are independent of one another. • The number of reducers is decided by the user. By default, it is 1.

Reducer • Word count example • (I, 1) (am, 1) (taking, 1) (IDS594, 1) (class, 1) (This, 1) (Class, 1) (is, 1) (fun, 1)  (I, 1) (am, 1) (taking, 1) (IDS594, 1) (class, 2) (This, 1) (is, 1) (fun, 1)

Reducer Class public class ReducerClassextends MapReduceBaseimplements Mapper<intermediateKeyTypeintermediateKey, Iterator<intermediateValueType> inValues, outKeyTypeoutkey, outValueTypeoutValue> { // some variable definitions here; public void reduce(intermediateKeyTypeinKey, Iterator<intermediateValueType> inValues, OutputCollector<outKeyType, outValueType> output, Reporter reporter) throws IOException { // implementation body; } }

MapReduce Process

Driver Class • It is responsible for triggering the map reduce job in Hadoop public class DriverClassextends Configured implements Tool{ public int run(String[] args) throws Exception{ // some configuration statements here… } public static void main(String[] args) throws Exception{ int res = ToolRunner.run(new Configuration(), new driver(),args); System.exit(res); } }

Word Count Example

Mapper Class

import java.io.IOException; • import java.util.StringTokenizer; • import org.apache.hadoop.io.*; • import org.apache.hadoop.mapred.*; • public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>{ • private final static IntWritable one = new IntWritable(1); • private Text word = new Text(); • //map method that performs the tokenizer job and framing the initial key value pairs • public void map(LongWritablekey,Textvalue, OutputCollector<Text,IntWritable>output, Reporter reporter) throws IOException{ • //taking one line at a time and tokenizing the same • String line = value.toString(); • StringTokenizertokenizer = new StringTokenizer(line); • //iterating through all the words available in that line and forming the key value pair • while (tokenizer.hasMoreTokens()){ • word.set(tokenizer.nextToken()); • //sending to output collector which in turn passes the same to reducer • output.collect(word, one); • } • } • }

Reducer Class

import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; public class WordCountReducerextends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>{ //reduce method accepts the Key Value pairs from mappers //do the aggregation based on keys and produce the final out put public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException{ int sum = 0; /*iterates through all the values available with a key *and add them together *and give the final result as the key and sum of its values */ while (values.hasNext()){ sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

Driver Class

import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount extends Configured implements Tool{ public int run(String[] args) throws Exception{ //creating a JobConf object and assigning a job name for identification purposes JobConfconf = new JobConf(getConf(), WordCount.class); conf.setJobName("WordCount"); //Setting configuration object with the Data Type of output Key and Value conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); //Providing the mapper and reducer class names conf.setJarByClass(WordCount.class); conf.setMapperClass(WordCountMapper.class); conf.setReducerClass(WordCountReducer.class); //the hdfs input and output directory to be fetched from the command line FileInputFormat.addInputPath(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); return 0; } public static void main(String[] args) throws Exception{ int res = ToolRunner.run(new Configuration(), new WordCount(),args); System.exit(res); } }

Driver Class • Sometimes, Programmers put the driver into the main method instead of having a separate driver class.

Run MapReduce Programs under Hadoop • Create a jar file either using export in Eclipse or using jar command. • jar -cvf <jar filename> <class file> <class file> … • Move the data to HDFS • ./bin/hadoopfs –copyFromLocal <Local path to files> <HDFS path> • -put • Execute Hadoop job • ./bin/hadoop jar <jar file> <Class Name> <parameters> • Specify the number of reduce tasks • -D mapred.reduce.tasks=15 • Merge the output • Each reducer will generate its own output • Use –getmerge to get a merged result

Monitor Jobs • HDFS - http://localhost:50070/ • JobTracker - http://localhost:50030/

Input Files • This is where the data for a MapReduce job is initially stored. • Can be any types of format • Line-based log files • Binary format • Multi-line input records • Typically very large – tens of gigabytes or more

InputFormat Class • How these input files are split up and read is defined by the InputFormatclass. • Functionalities: • Selects the files or other objects that should be used for input • Defines the InputSplitsthat break a file into tasks • Provides a factory for RecordReaderobjects that read the file

Types of InputFormat • TextInputFormat is useful for unformatted data or line-based records like log files. • KeyValueInputFormat is useful for reading the output of one MapReduce job as the input to another. • SequenceFileInputFormat reads special binary files that are specific to Hadoop. • Sequence files are block-compressed and provide direct serialization and deserialization of several arbitrary data types(not just text).

InputSplits • An InputSplit describes a unit of work that comprises a single map task in a MapReduce program. • By default, the FileInputFormat and its descendants break a file up into 64MB chunks (the same size as blocks in HDFS). • mapred.min.split.size in hadoop-site.xml • Overriding this parameter in the JobConf object used in the driver class. • Map tasks are performed in parallel. • mapred.tasktracker.map.tasks.maximum for on-node parallelism

RecordReader

Customized InputFormat • Subclass the FileInputFormat rather than implement InputFormat directly. • We have to overrid thecreateRecordReader() method, which returns an instance of RecordReader: an object that can read from the input source. Ball, 3.5, 12.7, 9.0 Car, 15, 23.76, 42.23 Device, 0.0, 12.4, -67.1

public class ObjectPositionInputFormatextends FileInputFormat<Text, Point3D> { public RecordReader<Text, Point3D> createRecordReader( InputSplit input, JobConf job, Reporter reporter) throws IOException { reporter.setStatus(input.toString()); return new ObjPosRecordReader(job, (FileSplit)input); } }

public class ObjPosRecordReaderimplements RecordReader<Text, Point3D> { • private LineRecordReaderlineReader; • private LongWritablelineKey; • private Text lineValue; • public ObjPosRecordReader(JobConf job, FileSplit split) throws IOException { • lineReader = new LineRecordReader(job, split); • lineKey = lineReader.createKey(); • lineValue = lineReader.createValue(); • } • public boolean next(Text key, Point3D value) throws IOException { • // get the next line • if (!lineReader.next(lineKey, lineValue)) { • return false; • } • // parse the lineValue which is in the format: objName, x, y, z • String [] pieces = lineValue.toString().split(","); • if (pieces.length != 4) { • throw new IOException("Invalid record received"); • } • // try to parse floating point components of value • float fx, fy, fz; • try { • fx = Float.parseFloat(pieces[1].trim()); • fy = Float.parseFloat(pieces[2].trim()); • fz = Float.parseFloat(pieces[3].trim()); • } catch (NumberFormatExceptionnfe) { • throw new IOException("Error parsing floating point value in record"); • } • // now that we know we'll succeed, overwrite the output objects • key.set(pieces[0].trim()); // objName is the output key. • value.x = fx; value.y = fy; value.z = fz; • return true; • } • } public Text createKey() { return new Text(""); } public Point3D createValue() { return new Point3D(); } public long getPos() throws IOException { return lineReader.getPos(); } public void close() throws IOException { lineReader.close(); } public float getProgress() throws IOException { return lineReader.getProgress(); }

Another example of customized InputFormat public class NLinesInputFormatextends TextInputFormat{ @Override public RecordReader<LongWritable, Text> createRecordReader(InputSplit split, TaskAttemptContextcontext) { return new NLinesRecordReader(); } }

public class NLinesRecordReaderextends RecordReader<LongWritable, Text>{ private final int NLINESTOPROCESS = 3; @Override public booleannextKeyValue() throws IOException, InterruptedException { key.set(pos); value.clear(); final Text endline = new Text("\n"); intnewSize = 0; for(inti=0;i<NLINESTOPROCESS;i++){ Text v = new Text(); while (pos < end) { newSize = in.readLine(v, maxLineLength,Math.max((int)Math.min(Integer.MAX_VALUE, end-pos),maxLineLength)); value.append(v.getBytes(),0, v.getLength()); value.append(endline.getBytes(),0, endline.getLength()); if (newSize == 0) break; pos += newSize; if (newSize < maxLineLength) break; } } } }

In the driver class, add the following line: job.setInputFormatClass(NLinesInputFormat.class); • map() function in the Mapper class will be changed. public void map(LongWritable key, Text value,Context context) throws java.io.IOException ,InterruptedException { String lines = value.toString(); String [] lineArr= lines.split("\n"); intlcount = lineArr.length; context.write(new Text(new Integer(lcount).toString()),new IntWritable(1)); }

Mapper • A new instance of Mapper is instantiated in a separate Java process for each map task(InputSplit). • The individual Mappers are not provided a mechanism to communicate with one another. • Two additional parameters in map() method • OutputCollector: has a method collect() which will forward a (key, value) pair to the reduce phase of the job. • Reporter: provides information about the current task.

Partition and Shuffle • Moving map outputs to the reducers is known as shuffling. • A different subset of the intermediate key space is assigned to each reduce node. These subsets(“partitions”) are the inputs to the reduce task. • All values for the same key are always reduced together regardless of which mapper is its origin.

Partitioner • Determines which partition a given(key, value) pair will go to. • Determines which reducer instance will receive which intermediate keys and values. • The default partitioner computes a hash value for the key and assigns the partition based on this result. • HadoopMapReduce determines how many partitions it will divide the data into. It is based on the number of reduce tasks (controlled by JobConf.setNumReduceTasks() method).

Customize Partitioner • The default partitioner implementation is called HashPartitioner. • Uses hashCode() method of the key objects modulo the number of partitions total to determine which partition to send a given(key, value) pair to. public interface Partitioner<K, V> extends JobConfigurable{ intgetPartition(K key, V value, intnumPartitions); }

More on Partitioner • For most randomly-distributed data, HashPartitioner should result in all partitions being of roughly equal size. • For some cases, if hashCode() can not provide a uniformly-distributed values over its range, then data may not be sent to reducers evenly. • Poor partitioning may result in unbalanced workloads and consequently an inefficient performance. • JobConf.setPartitionerClass() method to tell Hadoop to use a specific partitioner.

Sort • Each reduce task may reduce the values associating with several intermediate keys from map task. • The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer.

Reducer • reduce() method receives a key as well as an iterator over all the values associated with the key. • OutputCollector and Reporter as the same as in the map() method.

OutputFormat • The (key, value) pairs provided to the OutputCollector are then written to output files. • Each reducer writes to a separate file in a common output directory. • These files will typically be named part-nnnnn, where nnnnn is the partition id associated with the reduce task. • FileOutputFormat.setOutputPath()

The NullOutputFormat generates no output files and disregards any (key, value) pairs passed to it by the OutputCollector. • It is useful if you are explicitly writing your own output files in the reduce() method, and do not want additional empty output files generated by the Hadoop framework.

Combiner

The Combiner will receive as input all data emitted by the Mapper instances on a given node. The output from the Combiner is then sent to the Reducers, instead of the output from the Mappers. • A "mini-reduce" process which operates only on data generated by one machine. • conf.setCombinerClass(Reduce.class);

Data Type • Writable types • IntWritable • LongWritable • FloatWritable • BooleanWritable • … • Your own classes which implement Writable (serialization)

public class Point3D implements Writable { public float x; public float y; public float z; public Point3D(float x, float y, float z) { this.x= x; this.y= y; this.z= z; } public Point3D() { this(0.0f, 0.0f, 0.0f); } public void write(DataOutput out) throws IOException { out.writeFloat(x); out.writeFloat(y); out.writeFloat(z); } public void readFields(DataInput in) throws IOException { x = in.readFloat(); y = in.readFloat(); z = in.readFloat(); } public String toString() { return Float.toString(x) + ", ” + Float.toString(y) + ", ” + Float.toString(z); } }

MapReduce Example Finding the intersection of edges among multiple graphs

Data Input: Graph_id<tab>source<tab>destination graph1<tab>node1<tab>node3 graph1<tab>node1<tab>node2 graph1<tab>node4<tab>node6 graph2<tab>node3<tab>node5 graph2<tab>node4<tab>node6 graph2<tab>node7<tab>node3 graph3<tab>node1<tab>node3 graph3<tab>node1<tab>node2 graph3<tab>node2<tab>node7 Output: node1:node2<tab>graph3^graph2^graph1 node2:node4<tab>graph3^graph2^graph1 node2:node7<tab>graph3^graph2^graph1 node3:node5<tab>graph3^graph2^graph1

IDS594 Special Topics in Big Data Analytics