Harnessing MapReduce for Machine Learning on Multicore Processors

MapReduce for Machine Learning on Multicore Cheng-Tao Chu and Sang Kyun Kimet al. Stanford University, Stanford CA Presented by Inna Rytsareva

Outline • What is MapReduce? • Problem Description and Formalization • Statistical Query Model and Summation Form • Architecture (inspired by MapReduce) • Adopted ML Algorithms • Experiments • Future of MapReduce for Machine Learning • Discussion Map-Reduce for Machine Learning on Multicore

Motivation • Problem: lots of data • Example: 20+ billion web pages x 20KB = 400+ terabytes • One computer can read 30-35 MB/sec from disk • ~four months to read the web • ~1,000 hard drives just to store the web • Even more to do something with the data

Motivation • Solution: spread the work over many machines • Same problem with 1000 machines, < 3 hours • programming work • communication and coordination • recovering from machine failure • status reporting • debugging • optimization • locality • repeat for every problem you want to solve

Cluster Computing • Many racks of computers, thousands of machines per cluster • Limited bisection bandwidth between racks http://upload.wikimedia.org/wikipedia/commons/d/d3/IBM_Blue_Gene_P_supercomputer.jpg

MapReduce • A simple programming model that applies to many large-scale computing problems • Hide messy details in MapReduce runtime library: • automatic parallelization • load balancing • network and disk transfer optimization • handling of machine failures • robustness

Programming model • Input & Output: each a set of key/value pairs • Programmer specifies two functions: map() reduce() • map (in_key, in_value) -> list(out_key, intermediate_value) • Processes input key/value pair • Produces set of intermediate pairs • reduce (out_key, list(intermediate_value)) -> list(out_value) • Combines all intermediate values for a particular key • Produces a set of merged output values (usually just one)

Example • Page 1: the weather is good • Page 2: today is good • Page 3: good weather is good.

Map Output • Worker 1: • (the 1), (weather 1), (is 1), (good 1). • Worker 2: • (today 1), (is 1), (good 1). • Worker 3: • (good 1), (weather 1), (is 1), (good 1).

Reduce Input • Worker 1: • (the 1) • Worker 2: • (is 1), (is 1), (is 1) • Worker 3: • (weather 1), (weather 1) • Worker 4: • (today 1) • Worker 5: • (good 1), (good 1), (good 1), (good 1)

Reduce Output • Worker 1: • (the 1) • Worker 2: • (is 3) • Worker 3: • (weather 2) • Worker 4: • (today 1) • Worker 5: • (good 4)

http://4.bp.blogspot.com/_j6mB7TMmJJY/STAYW9gC-NI/AAAAAAAAAGY/lLKo7sBp5i8/s1600-h/P1.pnghttp://4.bp.blogspot.com/_j6mB7TMmJJY/STAYW9gC-NI/AAAAAAAAAGY/lLKo7sBp5i8/s1600-h/P1.png

Fault tolerance • On worker failure: • Detect failure via periodic heartbeats (ping) • Re-execute completed and in-progress map tasks • Re-execute in progress reduce tasks • Task completion committed through master • Master failure: • Could handle, but don't yet (master failure unlikely)

MapReduce Transparencies • Parallelization • Fault-tolerance • Locality optimization • Load balancing

Suitable for your task if • Have a cluster • Working with large dataset • Working with independent data (or assumed) • Can be cast into map and reduce

References • Original paper J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Communications of the ACM, vol. 51, 2008, pp. 107-113. (http://labs.google.com/papers/mapreduce.html) • On wikipedia (http://en.wikipedia.org/wiki/MapReduce) • Hadoop – MapReduce in Java (http://lucene.apache.org/hadoop/) • Starfish - MapReduce in Ruby (http://rufy.com/starfish/)

Motivations • Industry-wide shift to multicore • No good framework for parallelize ML algorithms • Goal: develop a general and exact technique for parallel programming of a large class of ML algorithms for multicore processors http://upload.wikimedia.org/wikipedia/commons/a/af/E6750bs8.jpg

Idea

Valiant Model [Valiant’84] • x is the input • y is a function of x that we want to learn • In Valiant model, the learning algorithm uses randomly drawn examples <x, y> to learn the target function

Statistical Query Model [Kearns’98] • A restriction on Valiant model • A learning algorithm uses some aggregates over the examples, not the individual examples • Given a function f(x,y) over instances (data points x and labels y), a statistical oracle will return an estimate of the expectation of f(x,y) • Any model that computes gradients or sufficient statistics over f(x,y) ﬁts this model • Typically this is achieved by summing over the data.

Summation Form • Aggregate over the data: • Divide the data set into pieces • Compute aggregates on each cores • Combine all results at the end

Example: Linear Regression Model: Goal: Solution: Given m examples: (x1, y1), (x2, y2), …, (xm, ym) We write a matrix X with x1, …, xmas rows, and row vector Y=(y1, y2, …ym). Then the solution is • Parallel computation:

Lighter Weight MapReduce for Multicore

Outline • What is MapReduce? • Problem Description and Formalization • Statistical Query Model and Summation Form • Architecture (inspired by MapReduce) • Adopted ML Algorithms • Experiments • Future of MapReduce for Machine Learning • Conclusion and Discussion Map-Reduce for Machine Learning on Multicore

Locally Weighted Linear Regression (LWLR) • Mappers: one sets compute subgroups of A, the other set compute subgroups b • Two reducers for computing A and b • Finally compute the solution Solve: When wi == 1, this is least squares.

Naïve Bayes (NB) • Goal: estimate P(xj=k|y=1) and P(xj=k|y=0) and P(y) • Computation: count the occurrence of (xj=k, y=1) and (xj=k, y=0), count the occurrence of (y=1) and (y=0) • Mappers: count a subgroup of training samples • Reducer: aggregate the intermediate counts, and calculate the final result

Gaussian Discriminative Analysis (GDA) • Goal: classification of x into classes of y • assuming each class is a Gaussian Mixture model with different means but same covariance. • Computation: • Mappers: compute for a subgroup of training samples • Reducer: aggregate intermediate results

K-means • Computing the Euclidean distance between sample vectors and centroids • Recalculating the centroids • Divide the computation to subgroups to be handled by map-reduce

Neural Network (NN) • Back-propagation, 3-layer network • Input, middle, 2 output nodes • Goal: compute the weights in the NN by back propagation • Mapper: propagate its set of training data through the network, and propagate errors to calculate the partial gradient for weights • Reducer: sums the partial gradients and does a batch gradient descent to update the weights

Principal Components Analysis (PCA) • Compute the principle eigenvectors of the covariance matrix • Clearly, we can compute the summation form using map-reduce • Express the mean vector as a sum

Other Algorithms • Logistic Regression • Independent Component Analysis • Support Vector Machine • Expectation Maximization (EM)

Time Complexity Basically: Linear speed up with increasing number of cores

Setup • Compare map-reduce version and sequential version • 10 data sets • Machines: • Dual-processor Pentium-III 700MHz, 1GB RAM • 16-way Sun Enterprise 6000

Dual-Processor SpeedUps

SpeedUp for 2-16 processors Bold – average Error Bars – max/min Dashed - variance

Multicore Simulator Results • Multicore simulator over the sensor dataset • Better results – reported for NN & LR • NN • 16 cores 15.5x • 32 cores 29x • 64 cores 54x • LR • 16 cores 15x • 32 cores 29.5x • 64 cores 53x • Could be because of less communication cost

Conclusion • Parallelize summation forms • NO change in the underlying algorithm • NO approximation • Use map-reduce on a single machine

Apache Mahout • An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License • http://mahout.apache.org • Why Mahout? • Community • Documentation and Examples • Scalability • the Apache License • Not-specific research-oriented http://dictionary.reference.com/browse/mahout

Focus: Scalable • Goal: Be as fast and efficient as the possible given the intrinsic design of the algorithm • Some algorithms won’t scale to massive machine clusters • Others fit logically on a MapReduce framework like Apache Hadoop • Still others will need other distributed programming models • Most Mahout implementations are MapReduce enabled • Work in Progress

Sampling of Who uses Mahout? https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout

Focus: Machine Learning Applications Examples Genetic Freq. Pattern Mining Classification Clustering Recommenders Utilities Lucene/Vectorizer Math Vectors/Matrices/SVD Collections (primitives) Apache Hadoop http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms

Resources • “Mahout in Action” • Owen, Anil, Dunning and Friedman • http://awe.sm/5FyNe • “Introducing Apache Mahout” • http://www.ibm.com/developerworks/java/library/j-mahout/ • “Taming Text” by Ingersoll, Morton, Farris • “Programming Collective Intelligence” by Toby Segaran • “Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank • “Data-Intensive Text Processing with MapReduce” by Jimmy Lin and Chris Dyer

Discussion • What are other alternatives to MapReduce? • What to do if “summation form” is not applicable? • Does the dataset quality effect implementation and performance of parallel machine learning algorithms? • Multicore processors… future? Predicting Structural and Functional Sites in Proteins by Searching for Maximum-Weight Cliques

Harnessing MapReduce for Machine Learning on Multicore Processors