1 / 36

Statistical Analysis and Machine Learning using Hadoop

Statistical Analysis and Machine Learning using Hadoop. Seungjai Min Samsung SDS. Knowing that…. Hadoop/Map-Reduce has been successful in analyzing unstructured web contents and social media data

noleta
Télécharger la présentation

Statistical Analysis and Machine Learning using Hadoop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Analysis and Machine Learning using Hadoop Seungjai Min Samsung SDS

  2. Knowing that… Hadoop/Map-Reduce has been successful in analyzing unstructured web contents and social media data Another source of big data is semi-structured machine/device generated logs, which require non-trivial data massaging and extensive statistical data mining

  3. Question Is Hadoop/Map-Reduce the right framework to implement statistical analysis (more than counting and descriptive statistics) and machine learning algorithms (which involve iterations)?

  4. Answer and Contents of this talk • Yes, Hadoop/Map-Reduce is the right framework • Why is it better than MPI and CUDA? • Map-Reduce Design Patterns • Data Layout Patterns • No, but there are better alternatives • Spark/Shark (as an example) • R/Hadoop (it is neither RHadoop nor Rhive)

  5. Contents • Programming Models • Map-Reduce vs. MPI vs. Spark vs. CUDA(GPU) • Map-Reduce Design Patterns • Privatization Patterns (Summarization / Filtering / Clustering) • Data Organization Patterns (Join / Transpose) • Data Layout Patterns • Row vs. Column vs. BLOB • Summary • How to choose the right programming model for your algorithm

  6. Parallel Programming isDifficult • Too manyparallel programming models (languages) Cilk Brook Titanium Co-array Fortran RapidMind PVM CUDA UPC OpenMP MPI Chapel P-threads Fortress OpenCL Erlang X10 Intel TBB

  7. MPI Framework myN = N / nprocs; for (i=0; i<=myN; i++) { A[i] = initialize(i); } left_index = …; right_index = …; MPI_Send(pid-1, A[left_index], sizeof(int), …); MPI_Recv(pid+1, A[right_index], sizeof(int), …); for (i=0; i<=myN; i++) { B[i] = (A[i]+A[i+1])/2.0; } 400 1 100 101 200 201 300 301 Assembly Language of the Parallel Programming

  8. Map-Reduce Framework Map/Combine/Partition Shuffle Sort/Reduce Reduce Map input key/val output key/val Map Reduce output key/val key/val input Reduce Map output input key/val key/val Parallel Programming for the masses!

  9. Map-Reduce vs. MPI • Similarity • Programming model • Processes not threads • Address spaces are separate (data communications are explicit) • Data locality • “owner computes” rule dictates that computations are sent to where data is not the other way round

  10. Map-Reduce vs. MPI Differences

  11. GPU GPU Multi-core CPUs Shared memory $ $ $ Local Mem Local Mem GPGPU (General Purpose Graphic Processing Units) 10~50 times faster than CPU if an algorithm fits this model Good for embarrassingly parallel algorithms (e.g. image) Costs ($2K~$3.5K) and Performance (2 Quad-cores vs. One GPU) CPU CPU CPU Global Memory

  12. Programming CUDA cudaArray* cu_array; // Allocate array cudaMalloc(&cu_array, cudaCreateChannelDesc<float>(), width, height); // Copy image data to array cudaMemcpy(cu_array, image, width*height, cudaMemcpyHostToDevice); // Bind the array to the texture cudaBindTexture(tex, cu_array); dim3 blockDim(16, 16, 1); dim3 gridDim(width / blockDim.x, height / blockDim.y, 1); kernel<<< gridDim, blockDim, 0 >>>(d_odata, width, height); cudaUnbindTexture(tex); __global__ void kernel(float* odata, int height, int width) { unsignedint x = blockIdx.x*blockDim.x + threadIdx.x; unsignedint y = blockIdx.y*blockDim.y + threadIdx.y; float c = texfetch(tex, x, y); odata[y*width+x] = c; } Hard to program/debug  hard to find good engineers  hard to maintain codes

  13. Design Patterns in Parallel Programming p_sum = 0; #pragma omp parallel private(p_sum) { #pragma omp for for (i=1; i<=N; i++) { p_sum += A[i]; } #pragma omp critical { sum += p_sum; } } Privatization Idiom

  14. Design Patterns in Parallel Programming #define N 400 #pragma omp parallel for for (i=1; i<=N; i++) { A[i] = 1; } sum = 0; #pragma omp parallel for reduction(+:sum) for (i=1; i<=N; i++) { sum += A[i]; // dependency } printf(“sum = %d\n”, sum); Reduction Idiom

  15. Design Patterns in Parallel Programming x = K; for (i=0; i<N; i++) { A[i] = x++; } x = K; for (i=0; i<N; i++) { A[i] = x + i; } Induction Idiom

  16. Design Patterns in Parallel Programming Perfect fit for Map-Reduce framework p_sum = 0; #pragma omp parallel private(p_sum) { #pragma omp for for (i=1; i<=N; i++) { p_sum += A[i]; } #pragma omp critical { sum += p_sum; } } Map Map Map Reduce Privatization Idiom

  17. MapReduce Design Patterns Summarization patterns Filtering patterns Data organization patterns Join patterns Meta-patterns Input and output patterns Book written by Donald Miner & Adam Shook

  18. Design Patterns Y = bX + e y b y1 x1 e1 yi y2 x2 e2 + = b * y3 x3 e3 y4 x4 e4 x xi y5 x5 e5 Linear Regression (1-dimension)

  19. Design Patterns Y = bX + e n y x11 y1 x21 e1 x12 y2 x22 e2 + m x13 y3 = b * x23 e3 x14 y4 x24 e4 x1 x15 y5 x25 e5 m: # of observations n : # of dimension x2 Linear Regression (2-dimension)

  20. Design Patterns n n m XTX = n m * = n Linear Regression (distributing on 4 nodes)

  21. Design Patterns n (XTX)-1= inverse of n • If n2 is sufficiently small enough  Apache math library • n should be kept small  Avoid curse of dimensionalty Linear Regression

  22. Design Patterns ID age name … … … ID time dst … … … 100 25 Bob … … … 100 7:28 CA … … … 210 31 John … … … 100 8:03 IN … … … 360 46 Kim … … … 210 4:26 WA … … … Inner join A.ID A.age A.name … … … B.ID B.time B.dst … … … 100 25 Bob … … … 100 7:28 CA … … … 100 25 Bob … … … 100 8:03 IN … … … 210 4:26 WA … … … 210 31 John … … … Join

  23. Design Patterns Network overhead Reduce-side Join Map Reduce 100 25 Bob … 100 25 Bob … 100 25 Bob … … … 210 31 John … … … Map Reduce 360 46 Kim … … … 210 31 John … … … … … … … 100 7:28 CA … … … 100 8:03 IN … … … 210 4:26 WA … … … Map Reduce … … … … … … 360 46 Kim … Join

  24. Performance Overhead (1) Map/Combine/Partition Shuffle Reduce Reduce Map input key/val output key/val Map Reduce output key/val key/val input Reduce Map output input key/val key/val Disk I/O Disk I/O Map-Reduce suffers from Disk I/O bottlenecks

  25. Performance Overhead (2) • Iterative algorithms & Map-Reduce Chaining Join Groupby Decision-Tree Map Map Map Reduce Reduce Reduce Map Map Map Reduce Reduce Reduce Reduce Map Reduce Reduce Map Map Disk I/O Disk I/O

  26. HBase Caching • HBase provides Scanner caching and Block caching • Scanner caching • setCaching(int cache); • tells the scanner how many rows to fetch at a time • Block caching • setCacheBlocks(true); • HBase caching helps read/write performance but not sufficient to solve our problem

  27. Spark / Shark • Spark • In-memory computing framework • An Apache incubator project • RDD (Resilient Distributed Datasets) • A fault-tolerant framework • Targets iterative machine learning algorithms • Shark • Data warehouse for Spark • Compatible with Apache Hive

  28. Spark / Shark Map Reduce Map Reduce Map Reduce Spark Hadoop Spark Hadoop Spark Hadoop Mesos Mesos / YARN Linux Linux Linux - No fine-grained scheduling btw Hadoop and Spark - Mesos: Hadoop dependency - YARN - Stand-alone Spark - No fine-grained scheduling within Spark Scheduling

  29. Time-Series Data Layout Patterns BLOB (uncompressed) Column Row Ti1 Ti2 Ti3 Ti4 Ti5 Ti6 Ti7 Ti8 … Ti1Ti2Ti3Ti4Ti5Ti6Ti7Ti8 … bin + : no conversion - : slow read + : fast read/write - : slow conversion + : fast read/write - : slow search

  30. Time-Series Data Layout Patterns Column Row Ti1 Ti2 Ti3 Ti4 Ti5 Ti6 Ti7 Ti8 Ti9 … RDB is columnar Ti1Ti2Ti3Ti4Ti5Ti6Ti7Ti8 … RDB When loading/unloading from/to RDB, it is really important to decide whether to store in column or row format

  31. R and Hadoop R is memory-based  Cannot run data that cannot fit inside a memory R is not thread-safe  Cannot run in a multi-threaded environment Creating a distributed version of each and every R function  Cannot take advantage of 3500 R packages that are already built!

  32. Running R from Hadoop 6000~7000 t1 t2 t3 t4 … t1M 1M • Pros: can re-use R packages with no modification • Cons: cannot handle large data that cannot fit into memory • But, do we need large number of time-series data to predict the future? What if the data are wide and fat?

  33. Not so big data • “Nobody ever got fired for using Hadoop on a cluster?” • HOTCDP’12 paper • Average Map-Reduce like jobs handle less than 14 GB • Time-series analysis for data forecasting • Sampling every minute for two-years to forecasting next year  less than 2M rows • It becomes big when sampling at sub-second resolution

  34. Statistical Analysis and Machine Learning Library Filtering Chain, Iterative Big Map-Reduce Spark + SQL (Hive / Shark / Impala / …) R on Hadoop Small, but many Small R on a single server

  35. Summary Map-Reduce is surprisingly efficient framework for most filter-and-reduce operations As for data massaging (data pre-processing), in-memory capability with SQL support is a must Calling R from Hadoop can be quite useful when analyzing many but, not-so-big data and is a fastest way to increase your list of statistical and machine learning functions

  36. Thank you!

More Related