IDS594 Special Topics in Big Data Analytics

IDS594 Special Topics in Big Data Analytics Week5

Chaining Jobs • Many problems can be solved with MapReduce, by writing several MapReduce steps which run in series to accomplish a goal. • Run the same Mapper and Reducer multiple times with slight alterations such as change of input and output files. Each iteration can use the previous iteration's output as its input. • Map1 -> Reduce1 -> Map2 -> Reduce2 -> Map3…

Method 1 • First create the JobConf object "job1" for the first job and set all the parameters with "input" as input directory and "temp" as output directory. Execute this job: JobClient.run(job1). • Immediately below it, create the JobConf object "job2" for the second job and set all the parameters with "temp" as input directory and "output" as output directory. Finally execute second job: JobClient.run(job2).

Method 2 • Create two JobConf objects and set all the parameters in them just like (Method 1) except that you don't use JobClient.run. • Create two Job objects with jobconfs as parameters: Job job1=new Job(jobconf1); Job job2=new Job(jobconf2); • Using the jobControl object, you specify the job dependencies and then run the jobs • JobControljbcntrl=new JobControl("jbcntrl"); • jbcntrl.addJob(job1); • jbcntrl.addJob(job2); • job2.addDependingJob(job1); • jbcntrl.run();

Method 3 • Using counters and termination conditions

Bread-First Search • One way of performing the BFS is by coloring the nodes and traversing according to the color of the nodes. • white(unvisited), gray(visited) and black(finished) • At the beginning, all nodes are colored white. • The source node is colored gray. • The gray node indicates that it is visited and its neighbors should be processed. • All the nodes adjacent to a gray node that are white are changed to be gray colored. • The original gray node is colored. • The process continues until there are no more gray nodes to process in the graph.

Single-source All Pairs Shortest Path Using Parallel Breadth-First Search • Input format : • source<tab>adjacency_list|distance_from_the_source|color|parentNode • All the edge weights are assumed to be 1.

Intermediate output 2: Reducer 1: (part-r-00000) 2<tab>1,3,4,5,|1|BLACK|1 5<tab>2,|2|GRAY|2 Reducer 2: (part-r-00001) 3<tab>1,4,2,|1|BLACK|1 Reducer 3: (part-r-00002) 1<tab>2,3,|0|BLACK|source 4<tab>2,3,|2|GRAY|2

Final output: Reducer 1: (part-r-00000) 2<tab>1,3,4,5,|1|BLACK|1 5<tab>2,|2|BLACK|2 Reducer 2: (part-r-00001) 3<tab>1,4,2,|1|BLACK|1 Reducer 3: (part-r-00002) 1<tab>2,3,|0|BLACK|source 4<tab>2,3,|2|BLACK|2

Counters • Gathering statistics about the job; for quality-control, or for application-level statistics. • Hadoopmaintains some built-in counters for every job, which reports various metrics for your job. • User-defined counters that can be incremented or decremented by the driver, mapper or the reducer. • Counters are defined by Java enum, that serves to group related counters. • Counters are global.

static enumMoreIterations { numberOfIterations } • "MoreIterations" is the group name for the Counter • "numberOfIterations" represent the counter name. • The numberOfIterations acts as a global variable between the reducer and the driver. • The numberOfIterations is incremented if there are more gray nodes to process in the graph.

Reducer • In addition to required reduce work, you have to add the following: if (outNode.getColor() == Node.Color.GRAY){ context.getCounter(MoreIterations.numberOfIterations).increment(1L); }

public int run(String[] args) throws Exception { intiterationCount= 0; // counter to set the ordinal number of the intermediate outputs Job job; long terminationValue=1; while( terminationValue >0){ job = getJobConf(args); // get the job configuration String input, output; if (iterationCount == 0) // the first iteration the input will be the first input argument input = args[0]; else input = args[1] + iterationCount; output = args[1] + (iterationCount + 1); // setting the output file FileInputFormat.setInputPaths(job, new Path(input)); // setting the input files FileOutputFormat.setOutputPath(job, new Path(output)); job.waitForCompletion(true); // wait for the job to complete Counters jobCntrs = job.getCounters(); terminationValue = jobCntrs.findCounter(MoreIterations.numberOfIterations).getValue(); // if the counter's value is incremented in the reducer(s), then there are more GRAY nodes to process implying that the iteration has to be continued. iterationCount++; } return 0; }

Frequent Itemsets and Association Rules

The Market Basket Model • Item • Basket (“transaction”) • Each basket contains a set of items (itemset) • The number of items in a basket is much smaller than the total number of items • The number of baskets are very large, bigger than what can fit in main memory

Real Data Sample 1. {Cat, and, dog, bites} 2. {Yahoo, news, claims, a, cat, mated, with, a, dog, and, produced, viable, oﬀspring} 3. {Cat, killer, likely, is, a, big, dog} 4. {Professional, free, advice, on, dog, training, puppy, training} 5. {Cat, and, kitten, training, and, behavior} 6. {Dog, &, Cat, provides, dog, training, in, Eugene, Oregon} 7. {“Dog, and, cat”, is, a, slang, term, used, by, police, oﬃcers, for, a, male–female, relationship} 8. {Shop, for, your, show, dog, grooming, and, pet, supplies} Baskets are sets

Support Threshold • If I is a set of items {I1, I2, … , Ii}, the support for I is the number of baskets for which I is a subset. We say I is frequent if its support is >= a threshold s.

Frequent Singletons • Among the singleton sets, obviously {cat} and {dog} are quite frequent. • “Dog”: 7 • “cat” : 6 • “and” : 5 • “a” and “training” : 3 • “for” and “is” : 2 • No other word appears more than once. • 5 frequent singleton itemsets if s = 3 • {dog}, {cat}, {and}, {a}, and {training}.

Frequent Doubletons • A doubleton cannot be frequent unless both items in the set are frequent by themselves. • 5 frequent doubletons if s = 3

Frequent Triples • In order to be a frequent triple, each pair of elements in the set must be a frequent doubleton. • {dog, a, and} cannot be a frequent itemset, because if it were, then surely {a, and} would be frequent, but it is not. • {dog, cat, a} might be frequent, since its doubletons are all frequent. In fact, it is a frequent triple (s = 3) • As there is only one frequent triple, there can be no frequent quadruples or larger sets.

Applications of Frequent Itemsets • True market basket analysis • Items: products • Baskets: transactions in a retailer store • Most famous frequent doubletons: (diapers, beer) • Related concepts • Items: words • Baskets: webpage / blogs / tweets / news articles • Plagiarism • Items: documents • Baskets: sentences

Association Rules • The form of an association rule is I → j, where I is a set of items and j is an item. • If all of the items in I appear in some basket, then j is “likely” to appear in that basket as well. • Confidence to formalize “likely” • Confidence(I  j) = Support(I ∪ {j})/Support(I) • Confidence(I  j) != Confidence(j  I)

Example • {cat, dog}  and • Support({cat, dog}) = 5 • Support({cat, dog, and}) = 3 • Confidence = 3/5 • {cat}  kitten • Support({cat, kitten}) = 1 • Support({cat}) = 6 • Confidence = 1/6

Interest of an Association Rule • The interest of an association rule I  j would be the difference between its confidence and the fraction of baskets contain j. • Interest(I  j) = confidence(I  j) – support(j)/B • Where B is the total number of baskets • Either positive or negative is “interesting”

Goal • To find association rules with high confidence • Support for I must be reasonably high • Around 1% of the baskets • Confidence for I  j must be reasonably high • Around 50%

Distributed Algorithm

Representation of Market-Basket Data • Store as a text file • Each line represents a basket (“transaction”) • Within each basket, items are separated by commas. • It takes approximately time O(nk/k!) to generate all the subsets of size k for a basket with n items.

Often, we need only small frequent itemsets, so k never grows beyond 2 (two itemsets problem) or 3 (three itemsets problem). • {a}  b • {a, b}  c

Can Fit MapReduce Framework? • Confidence({a}  b) = Support({a, b})/Support({a}) • Each transaction is independent of one another • (key, value) pairs • Similar to counting problem YES, IT CAN

First Map Function map(LongWritable, Text, Text, IntWritable){ // parse each line to get all items; // foreach single item i: // (i, 1)  output; // foreach pair of items (i, j): // ({i, j}, 1)  output; }

First Reduce Function reduce(Text, Iterator<IntWritable>, Text, IntWritable){ // foreach key: // sum += value; // output.put(key, value); }

Second Map Function • The input of this map is the output of the first reduce function • It doesn’t do anything but read the temporary output from the first reduce function

Second Reduce Function reduce(Text, Iterator<IntWritable>, Text, DoubleWritable){ //foreach key: if(size of key == 2){ confidence(ij) = value(i,j)/value(i); confidence(ji) = value(i,j)/value(j); } // output the confidence ij and ji }

Three itemsets? • {a}  b • {a, b}  c • You need to change is the first map function to get all subsets of size 1, 2, and 3. • List all singletons, doubletons and triples • The second reduce function is also needed to change. • {a, b}  c: Support({a, b, c})/Support({a, b}) • {a, c}  b: Support({a, b, c})/Support({a, c}) • {b, c}  a: Support({a, b, c})/Support({b, c})

Thresholds • Set the threshold to get high support • Minimum support: around 1% • Set the threshold to get high confidence • Minimum confidence: around 50%

IDS594 Special Topics in Big Data Analytics