Statistical Analysis and Machine Learning using Hadoop

Statistical Analysis and Machine Learning using Hadoop Seungjai Min Samsung SDS

Knowing that… Hadoop/Map-Reduce has been successful in analyzing unstructured web contents and social media data Another source of big data is semi-structured machine/device generated logs, which require non-trivial data massaging and extensive statistical data mining

Question Is Hadoop/Map-Reduce the right framework to implement statistical analysis (more than counting and descriptive statistics) and machine learning algorithms (which involve iterations)?

Answer and Contents of this talk • Yes, Hadoop/Map-Reduce is the right framework • Why is it better than MPI and CUDA? • Map-Reduce Design Patterns • Data Layout Patterns • No, but there are better alternatives • Spark/Shark (as an example) • R/Hadoop (it is neither RHadoop nor Rhive)

Contents • Programming Models • Map-Reduce vs. MPI vs. Spark vs. CUDA(GPU) • Map-Reduce Design Patterns • Privatization Patterns (Summarization / Filtering / Clustering) • Data Organization Patterns (Join / Transpose) • Data Layout Patterns • Row vs. Column vs. BLOB • Summary • How to choose the right programming model for your algorithm

Parallel Programming isDifficult • Too manyparallel programming models (languages) Cilk Brook Titanium Co-array Fortran RapidMind PVM CUDA UPC OpenMP MPI Chapel P-threads Fortress OpenCL Erlang X10 Intel TBB

MPI Framework myN = N / nprocs; for (i=0; i<=myN; i++) { A[i] = initialize(i); } left_index = …; right_index = …; MPI_Send(pid-1, A[left_index], sizeof(int), …); MPI_Recv(pid+1, A[right_index], sizeof(int), …); for (i=0; i<=myN; i++) { B[i] = (A[i]+A[i+1])/2.0; } 400 1 100 101 200 201 300 301 Assembly Language of the Parallel Programming

Map-Reduce Framework Map/Combine/Partition Shuffle Sort/Reduce Reduce Map input key/val output key/val Map Reduce output key/val key/val input Reduce Map output input key/val key/val Parallel Programming for the masses!

Map-Reduce vs. MPI • Similarity • Programming model • Processes not threads • Address spaces are separate (data communications are explicit) • Data locality • “owner computes” rule dictates that computations are sent to where data is not the other way round

Map-Reduce vs. MPI Differences

GPU GPU Multi-core CPUs Shared memory $ $ $ Local Mem Local Mem GPGPU (General Purpose Graphic Processing Units) 10~50 times faster than CPU if an algorithm fits this model Good for embarrassingly parallel algorithms (e.g. image) Costs ($2K~$3.5K) and Performance (2 Quad-cores vs. One GPU) CPU CPU CPU Global Memory

Programming CUDA cudaArray* cu_array; // Allocate array cudaMalloc(&cu_array, cudaCreateChannelDesc<float>(), width, height); // Copy image data to array cudaMemcpy(cu_array, image, width*height, cudaMemcpyHostToDevice); // Bind the array to the texture cudaBindTexture(tex, cu_array); dim3 blockDim(16, 16, 1); dim3 gridDim(width / blockDim.x, height / blockDim.y, 1); kernel<<< gridDim, blockDim, 0 >>>(d_odata, width, height); cudaUnbindTexture(tex); __global__ void kernel(float* odata, int height, int width) { unsignedint x = blockIdx.x*blockDim.x + threadIdx.x; unsignedint y = blockIdx.y*blockDim.y + threadIdx.y; float c = texfetch(tex, x, y); odata[y*width+x] = c; } Hard to program/debug  hard to find good engineers  hard to maintain codes

Design Patterns in Parallel Programming p_sum = 0; #pragma omp parallel private(p_sum) { #pragma omp for for (i=1; i<=N; i++) { p_sum += A[i]; } #pragma omp critical { sum += p_sum; } } Privatization Idiom

Design Patterns in Parallel Programming #define N 400 #pragma omp parallel for for (i=1; i<=N; i++) { A[i] = 1; } sum = 0; #pragma omp parallel for reduction(+:sum) for (i=1; i<=N; i++) { sum += A[i]; // dependency } printf(“sum = %d\n”, sum); Reduction Idiom

Design Patterns in Parallel Programming x = K; for (i=0; i<N; i++) { A[i] = x++; } x = K; for (i=0; i<N; i++) { A[i] = x + i; } Induction Idiom

Design Patterns in Parallel Programming Perfect fit for Map-Reduce framework p_sum = 0; #pragma omp parallel private(p_sum) { #pragma omp for for (i=1; i<=N; i++) { p_sum += A[i]; } #pragma omp critical { sum += p_sum; } } Map Map Map Reduce Privatization Idiom

MapReduce Design Patterns Summarization patterns Filtering patterns Data organization patterns Join patterns Meta-patterns Input and output patterns Book written by Donald Miner & Adam Shook

Design Patterns Y = bX + e y b y1 x1 e1 yi y2 x2 e2 + = b * y3 x3 e3 y4 x4 e4 x xi y5 x5 e5 Linear Regression (1-dimension)

Design Patterns Y = bX + e n y x11 y1 x21 e1 x12 y2 x22 e2 + m x13 y3 = b * x23 e3 x14 y4 x24 e4 x1 x15 y5 x25 e5 m: # of observations n : # of dimension x2 Linear Regression (2-dimension)

Design Patterns n n m XTX = n m * = n Linear Regression (distributing on 4 nodes)

Design Patterns n (XTX)-1= inverse of n • If n2 is sufficiently small enough  Apache math library • n should be kept small  Avoid curse of dimensionalty Linear Regression

Design Patterns ID age name … … … ID time dst … … … 100 25 Bob … … … 100 7:28 CA … … … 210 31 John … … … 100 8:03 IN … … … 360 46 Kim … … … 210 4:26 WA … … … Inner join A.ID A.age A.name … … … B.ID B.time B.dst … … … 100 25 Bob … … … 100 7:28 CA … … … 100 25 Bob … … … 100 8:03 IN … … … 210 4:26 WA … … … 210 31 John … … … Join

Design Patterns Network overhead Reduce-side Join Map Reduce 100 25 Bob … 100 25 Bob … 100 25 Bob … … … 210 31 John … … … Map Reduce 360 46 Kim … … … 210 31 John … … … … … … … 100 7:28 CA … … … 100 8:03 IN … … … 210 4:26 WA … … … Map Reduce … … … … … … 360 46 Kim … Join

Performance Overhead (1) Map/Combine/Partition Shuffle Reduce Reduce Map input key/val output key/val Map Reduce output key/val key/val input Reduce Map output input key/val key/val Disk I/O Disk I/O Map-Reduce suffers from Disk I/O bottlenecks

Performance Overhead (2) • Iterative algorithms & Map-Reduce Chaining Join Groupby Decision-Tree Map Map Map Reduce Reduce Reduce Map Map Map Reduce Reduce Reduce Reduce Map Reduce Reduce Map Map Disk I/O Disk I/O

HBase Caching • HBase provides Scanner caching and Block caching • Scanner caching • setCaching(int cache); • tells the scanner how many rows to fetch at a time • Block caching • setCacheBlocks(true); • HBase caching helps read/write performance but not sufficient to solve our problem

Spark / Shark • Spark • In-memory computing framework • An Apache incubator project • RDD (Resilient Distributed Datasets) • A fault-tolerant framework • Targets iterative machine learning algorithms • Shark • Data warehouse for Spark • Compatible with Apache Hive

Spark / Shark Map Reduce Map Reduce Map Reduce Spark Hadoop Spark Hadoop Spark Hadoop Mesos Mesos / YARN Linux Linux Linux - No fine-grained scheduling btw Hadoop and Spark - Mesos: Hadoop dependency - YARN - Stand-alone Spark - No fine-grained scheduling within Spark Scheduling

Time-Series Data Layout Patterns BLOB (uncompressed) Column Row Ti1 Ti2 Ti3 Ti4 Ti5 Ti6 Ti7 Ti8 … Ti1Ti2Ti3Ti4Ti5Ti6Ti7Ti8 … bin + : no conversion - : slow read + : fast read/write - : slow conversion + : fast read/write - : slow search

Time-Series Data Layout Patterns Column Row Ti1 Ti2 Ti3 Ti4 Ti5 Ti6 Ti7 Ti8 Ti9 … RDB is columnar Ti1Ti2Ti3Ti4Ti5Ti6Ti7Ti8 … RDB When loading/unloading from/to RDB, it is really important to decide whether to store in column or row format

R and Hadoop R is memory-based  Cannot run data that cannot fit inside a memory R is not thread-safe  Cannot run in a multi-threaded environment Creating a distributed version of each and every R function  Cannot take advantage of 3500 R packages that are already built!

Running R from Hadoop 6000~7000 t1 t2 t3 t4 … t1M 1M • Pros: can re-use R packages with no modification • Cons: cannot handle large data that cannot fit into memory • But, do we need large number of time-series data to predict the future? What if the data are wide and fat?

Not so big data • “Nobody ever got fired for using Hadoop on a cluster?” • HOTCDP’12 paper • Average Map-Reduce like jobs handle less than 14 GB • Time-series analysis for data forecasting • Sampling every minute for two-years to forecasting next year  less than 2M rows • It becomes big when sampling at sub-second resolution

Statistical Analysis and Machine Learning Library Filtering Chain, Iterative Big Map-Reduce Spark + SQL (Hive / Shark / Impala / …) R on Hadoop Small, but many Small R on a single server

Summary Map-Reduce is surprisingly efficient framework for most filter-and-reduce operations As for data massaging (data pre-processing), in-memory capability with SQL support is a must Calling R from Hadoop can be quite useful when analyzing many but, not-so-big data and is a fastest way to increase your list of statistical and machine learning functions

Thank you!

Statistical Analysis and Machine Learning using Hadoop

Statistical Analysis and Machine Learning using Hadoop

Presentation Transcript

Predicting Phospholipidosis Using Machine Learning

Statistical analysis using R

Statistical Machine Learning and Computational Biology

Digit Recognition Using Machine Learning

Query Rewriting Using Monolingual Statistical Machine Translation

CS 59000 Statistical Machine learning Lecture 3

CS 59000 Statistical Machine learning Lecture 24

Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Using Statistical Machine Learning in Cloud Computing

Machine and Statistical Learning for Database Querying

CS 59000 Statistical Machine learning Lecture 7

CS 59000 Statistical Machine learning Lecture 18

Network Traffic Analysis using HADOOP Architecture

Machine learning using spark

Smart Phones using Machine Learning

Using Statistical Machine Learning in Cloud Computing

Develop Machine Learning using Python

Intelligent Fall Detection Using Statistical Features and Machine Learning

Detecting Phishing Using Machine Learning

Emotion Recognition Using Machine Learning