DisCo : Distributed Co-clustering with Map-Reduce

DisCo: Distributed Co-clustering with Map-Reduce 2008 IEEE International Conference on Data Engineering (ICDE) S. Papadimitron, J. Sun Tzu-Li Tai, Tse-En Liu Kai-Wei Chan, He-Chuan Hoh IBM T.J. Watson Research Center NY, USA National Cheng Kung University Dept. of Electrical Engineering HPDS Laboratory

Agenda Motivation Background: Co-Clustering + MapReduce Proposed Distributed Co-Clustering Process Implementation Details Experimental Evaluation Conclusions Discussion 0 39

Fast Growth in Volume of Data Motivation • Google processes 20 petabytes of data per day • Amazon and eBay with petabytes of transactional data every day Highly variant structure of data • Data sources naturally generate data in impure forms • Unstructured, semi-structured 1 39

Problems with Big Data mining for DBMSs Motivation • Significant preprocessing costs for the majority of data mining tasks • DBMS lacks performance for large amount of data 2 39

Why distributed processing can solve the issues: Motivation • MapReduceis irrelevant to the schema or form of the input data • Many preprocessing tasks are naturally expressible with MapReduce • Highly scalable with commodity machines 3 39

Contributions of this paper: Motivation • Presents the whole process for distributed data mining • Specifically, focuses on the Co-Clustering mining task, and designs a distributed co-clustering method using MapReduce 4 39

BackGround: Co-Clustering • Also named biclustering, or two-mode clustering • Input format: a matrix of rows and columns • Output: Co-clusters (sub-matrices) which rows that exhibit similar behavior across a subset of columns 4*5 4*5 5 39

BackGround: Co-Clustering Why Co-Clustering? Traditional Clustering: Social Science Chinese English Math A C Student A Student B BD Student C Can only know that students A & C / B & D have similar scores Student D 6 39

Why Co-Clustering? BackGround: Co-Clustering Social Science Chinese English Math Co-Clustering: Student A Student B Student C Cluster 1 Cluster 2 Student D Good at Science + Math Good at English + Chinese + Social Studies Chinese Science English Social Math B & D A & C Student D Rows that have similar properties for a subset of selected columns Student B Student C Student A 7 39

Another Co-Clustering Example: Animal Data BackGround: Co-Clustering 8 39

The MapReduce Paradigm BackGround: MapReduce Map Reduce Map Reduce Map Reduce Map 11 39

Mining Network Logs to Co-Cluster Communication Behavior Distributed Co-Clustering Process 12 39

Mining Network Logs to Co-Cluster Communication Behavior Distributed Co-Clustering Process 13 39

The Preprocessing Process Distributed Co-Clustering Process HDFS HDFS MapReduce Job Build transpose adjacency list MapReduce Job Extract SrcIP + DstIP and build adjacency matrix DstIP HDFS MapReduce Job Build adjacency list IPAddress IPAddress IPAddress IPAddress IPAddress IPAddress IPAddress IPAddress IPAddress HDFS SrcIP IPAddress IPAddress IPAddress IPAddress IPAddress IPAddress … 0 1 0 1 1 0 0 0 1 1 1 0 …… 0 1 0 1 1 0 0 0 1 1 1 0 …… 0 0 1 00 0 0 0 0 00 0 …… 0 1 0 1 1 0 0 0 1 1 1 0 …… 0 0 1 00 0 0 0 0 00 0 …… 0 0 1 00 0 0 0 0 00 0 …… 14 39

Co-Clustering (Generalized Algorithm) Distributed Co-Clustering Process Goal: c(1) = 1 c(5) =2 c(2) = 1 c(4) =2 c(3) = 1 Co-cluster into 2x2 = 4 sub-matrices r(1) = 1 1 or 2, r(2) = 1 1 or 2, r(3) = 1 r(4) = 2 Random Initialize: 15 39

Co-Clustering (Generalized Algorithm) Distributed Co-Clustering Process Fix column labels, Iterate through rows: c(1) = 1 c(5) =2 c(2) = 1 c(4) =2 c(3) = 1 r(1) = 1 r(2) = 1 r(3) = 1 r(4) = 2 r(2) = 2 16 39

Co-Clustering (Generalized Algorithm) Distributed Co-Clustering Process Fix row labels, Iterate through columns: c(1) = 1 c(5) =2 c(2) = 1 c(4) =2 c(3) = 1 c(2) = 2 17 39

Co-Clustering with MapReduce Distributed Co-Clustering Process 1 -> 2,4,5 1 -> 2, 4, 5 2 -> 1, 3 3 -> 2, 4, 5 4 -> 1, 3 MR 2 -> 1,3 3 -> 2,4,5 4 -> 1,3 18 39

Co-Clustering with MapReduce Distributed Co-Clustering Process 1 -> 2,4,5 1 -> 2, 4, 5 2 -> 1, 3 3 -> 2, 4, 5 4 -> 1, 3 MR 2 -> 1,3 MapReduce Job based on parameters 3 -> 2,4,5 4 -> 1,3 19 39

M c(1) = 1 c(5) =2 c(2) = 1 c(4) =2 c(3) = 1 Distributed Co-Clustering Process 1 -> 2,4,5 M if r(1) = 2, cost becomes higher r(1) = 1 2 -> 1,3 emit (r(k), () ) = (1, {(1,2), 1}) M Mapper Function: 3 -> 2,4,5 For each K-V input, Calculate (with and ) Change row labels if results in lower cost (function of ) Emit (r(k), ()) M 4 -> 1,3 20 39

M c(1) = 1 c(5) =2 c(2) = 1 c(4) =2 c(3) = 1 Distributed Co-Clustering Process 1 -> 2,4,5 M 2 -> 1,3 if r(2) = 2, cost becomes lower r(2) = 2 M Mapper Function: emit (r(k), () ) = (2, {(2,0), 2}) 3 -> 2,4,5 For each K-V input, Calculate (with and ) Change row labels if results in lower cost (function of ) Emit (r(k), ()) M 4 -> 1,3 21 39

M Distributed Co-Clustering Process R 1 -> 2,4,5 M 2 -> 1,3 M 3 -> 2,4,5 R M 4 -> 1,3 22 39

Distributed Co-Clustering Process R Emit Reducer Function: For each K-V input, For each , Accumulate all into Union of all Emit R 23 39

Distributed Co-Clustering Process R Sync Results R 24 39

Preprocessing Co-Clustering Random given Distributed Co-Clustering Process Synced with best permutation Sync Results HDFS MapReduce Job Fix column Row iteration MapReduce Job Build transpose adjacency list MapReduce Job Fix row Column iteration Final Co-Clustering result with best permutations HDFS 25 39

Tuning the number of Reduce Tasks Implementation Details • The number of reduce tasks is related to the number of intermediate keys during the shuffle and sort phase • For the co-clustering row-iteration/column-iteration jobs, the number of intermediate keys is either or 26 39

M Implementation Details R 1 -> 2,4,5 M 2 -> 1,3 (row-iterate) inter-keys M 3 -> 2,4,5 R M 4 -> 1,3 27 39

Tuning the number of Reduce Tasks Implementation Details • So, for the row-iteration/column-iteration jobs, 1 reduce task is enough • However, for some preprocessing tasks such as graph construction where there are a lot of intermediate keys, needs much more reduce tasks 28 39

The Preprocessing Process Implementation Details HDFS HDFS MapReduce Job Build transpose adjacency list MapReduce Job Extract SrcIP + DstIP and build adjacency matrix DstIP HDFS MapReduce Job Build adjacency list IPAddress IPAddress IPAddress IPAddress IPAddress IPAddress IPAddress IPAddress IPAddress HDFS SrcIP IPAddress IPAddress IPAddress IPAddress IPAddress IPAddress … 0 1 0 1 1 0 0 0 1 1 1 0 …… 0 1 0 1 1 0 0 0 1 1 1 0 …… 0 0 1 00 0 0 0 0 00 0 …… 0 1 0 1 1 0 0 0 1 1 1 0 …… 0 0 1 00 0 0 0 0 00 0 …… 0 0 1 00 0 0 0 0 00 0 …… 29 39

Environment Experimental Evaluation • There are 39 nodes in four different blade enclosure • Gigabit Ethernet • Blade Server • CPU: two dual-core (Intel Xeon 2.66GHz) • Memory: 8GB • OS: Red Hat Enterprise Linux • Hadoop Distributed File System(HDFS) capacity: 2.4 TB 30 39

Datasets Experimental Evaluation 31 39

Preprocessing ISS Data Experimental Evaluation Optimal values of each situation Map tasks number 6 Reduce tasks number 5 Input splitsize 256MB 6 256MB 5 32 39

Co-Clustering TREC Data Experimental Evaluation After 25 nodes per iteration is roughly about 20 ± 2 seconds. It is better than what we can get on a machine with 48GB RAM. 33 39

Conclusion • Authors of the paper shared their lessons learnt from data mining experiences with vast quantities of data, particularly in the context of co-clustering, and recommends using a distributed approach • Designed a general MapReduce approach for co-clustering algorithms • Showed that the MapReduce co-clustering framework scales well with real world large datasets (ICC, TREC) 34 39

Discussion • Necessity of the global sync action • Questionable Scalability for DisCo 35 39

Co-Clustering Random given Necessity of the global sync action Discussion Synced with best permutation Sync Results MapReduce Job Fix column Row iteration MapReduce Job Fix row Column iteration Final Co-Clustering result with best permutations 36 39

M Discussion R 1 -> 2,4,5 M 2 -> 1,3 M 3 -> 2,4,5 R M 4 -> 1,3 37 39

Questionable Scalability of DisCo Discussion • For row-iteration jobs (or column-iteration jobs), the number of intermediate keys is fixed to be (or ) • This implies that for a given and , as the input matrix gets larger, the reducer size* will increase dramatically • Since a single reducer (key+associating values) is sent to one reduce task, the memory capacity of a computing node will be a severe bottleneck for overall performance *reference: Upper Bound and Lower Bound of a MapReduce Computation, 2013 VLDB 38 39

M Discussion R 1 -> 2,4,5 M 2 -> 1,3 M 3 -> 2,4,5 R M 4 -> 1,3 39 39

DisCo : Distributed Co-clustering with Map-Reduce