Sparsity-Cognizant Overlapping Co-clustering

March 11, 2010 Sparsity-Cognizant Overlapping Co-clustering Hao Zhu Dept. of ECE, Univ. of Minnesota http:// spincom.ece.umn.edu Acknowledgements:G. Mateos, Profs. G. B. Giannakis, N. D. Sidiropoulos, A. Banerjee, and G. Leus NSF grants CCF 0830480 and CON 014658 ARL/CTA grant no. DAAD19-01-2-0011

Outline • Motivation and context • Problem statement and Plaid models • Sparsity-cognizant overlapping co-clustering (SOC) • Uniqueness • Simulated tests • Conclusions and future research

Context • Co-clustering (biclustering) = two-way clustering • Clustering: partition the objects(samples, rows) based on a similarity criteria on their attributes(features, columns) [Tan-Steinbach-Kumar ’06] • Co-clustering: simultaneous clustering of objects and attributes [Busygin et al ’08] • Dense, approximately constant-valued submatrices • NP-hard: reduce to ordinary clustering as in k-means attributes objects

Context • Co-clustering (biclustering) = two-way clustering • Clustering: partition the objects(samples, rows) based on a similarity criteria on their attributes(features, columns) [Tan-Steinbach-Kumar ’06] • Co-clustering: simultaneous clustering of objects and attributes [Busygin et al ’08] • Dense, approximately constant-valued submatrices • NP-hard: reduce to ordinary clustering as in k-means • Application areas • Social network: cohesive subgroups of actors within a network[Wasserman et al ’94] • Bioinformatics: interpertable biological structure in gene expression data [Lazzeroni et al ’02] • Internet traffic: dominat host groups with strong interactions [Jin et al ’09]

Related Works and our Focus • Bipartite spectral graph partitioning [Dhillon ’01] • Matrix factorization based on SVD • Orthogonal nonnegative matrix factorization (tNMF) [Ding et al ’06] • Search for non-overlapping co-clusters using orthogonality • Overlapping co-clustering under Bayesian framework [Fu-Banerjee ’09] • Probabilistic model for co-cluster membership indicators and parameters • EM algorithm for inference and parameter estimation • E-step: Gibbs sampling for membership indicator detection • Plaid models [Lazzeroni et al ’02] • Superposition of multiple overlapping layers (co-clusters) • Greedy layer search: one at a time

Related Works and our Focus (cont’d) • Overlapping: some objects/attributes may relate to multiple co-clusters • Borrow plaid model features • Partial co-clustering motivated by “Uninteresting background” • Linear model requires low computational burden • Exploit sparsity in co-cluster membership vectors • Sparse information hidden in large data set • Parsimonious models: more interpretable and informative • Simultaneous cross-layer optimization • Compared with greedy layer-by-layer strategy • Our focus: Sparsity-cognizant overlapping co-clustering (SOC) algorithm

Modeling • Matrix Y: n× p • induced by two groups of interacting nodes and • Yij measures the strength of the relationship between and Ex-1: Internet traffic: traffic activity graph (TAG)1 • Track the traffic flow between inside host and outside host Ex-2: Gene expression microarray data2 • Measure the level with which gene is expressed in sample 1,2 The two pictures are taken from [Jin et al ’09] and [Lazzeroni et al ’02], respectively.

Submatrices • Hidden dense/uniform submatrices • capture a subset of that has similar feature values related to a subset of • reveal certain informative behavioral patterns • Features • distributed “sparsely” in Y compared to the data dimension np • may overlap because of some multiple-functioned nodes Goal: Efficient co-clustering algorithms to extract the underlying submatrices, by exploiting sparsity and accounting for possible overlapping

Plaid Models • Matrix Y: superposition of k submatrices (layers) • l : level of layer (0 background) 1 if ( ) is in the -th layer • il(jl) = 0 otherwise • Row/column-level related effects il and jl • l common to all the nodes in the layer • il and jlexpressthe node-related response

Problem Statement Problem: Given the plaid model, seek the optimal membership indicators • Data fitting error penalized by the L1 norm of the indicators •  > 0 controls the sparsity enforced • Facilitate extraction of the more informative/inpretable submatrices out of Y • The optimal solution is NP-hard • Binary constraints on membership indicators [Tuner et al ’05] • Product of different variables • Efficient sub-optimal algorithm to identify the submatrices jointly • Recall the submatrices are detected one at a time in [Lazzeroni et al ’02]

Sparsity-cognizant overlapping co-clustering (SOC) • Background-layer-free residue matrix Z • Iterative cycling update of , , and  • Per iteration s, (s) collects all ijk(s) values, likewise for r(s) and k(s) q(s) r(s-1) q(s) k(s) r(s) k(s-1) k(s-1) • Different from [Lazzeroni et al ’02] • All the k layers are updated jointly, less prone to error propagation across layers • Membership indicators are updated with binary constraint combinatorial complexity

Updating (s) • Given r(s-1)and k(s-1) • Unconstrained quadratic programming closed-form solution • Inversion of a large matrix leads to numerical instablity • Coordinate descent algorithm alternating across all the layers For l = 1, ..., k • Define residue matrix : • Reduce to by extracting from the rows il(s-1)=1 and the columns jl(s-1)=1 • Update for T cycles (T small)

Updating r(s) and k(s) • Given q(s)and k(s-1), determine r(s) • Obtain jointly membership indicators for the i -th row • Important for overlapping submatrices to eliminate cross effects • L1 norm penalty reduces to linear term due to non-negativity • Quadratic minimization subject to {0,1} binary constraints NP-hard • Similar problems in MIMO/multiuser detection with binary alphabet • (Near-) optimal sphere decoding algorithm (SDA) • Incurs polynomial (cubic) complexity in general • Same techniques to detect k(s) with q(s)and r(s)

Convergence and Implementation • SOC algorithm converges (at least) to a stationary point • Data fitting error cost: bounded below and non-increasing per iteration • Pruning steps [Lazzeroni et al ’02], [Tuner et al ’05] • Initialization • Background level fitting to obtain matrix Z (Recall the submatrix parameter fitting) • Membership indicators r(0) and k(0) : K-means [Tuner et al ’05] • Parameter choices • Number of layers k : explain a certain percentage of variation • Sparse regularization parameter : trial-and-error/bi-cross-validation [Witten et al ’09]

Uniqueness • Plain plaid models: decomposition into product of unknown matrices • Binary-valued matrices: R=[il](n ×k) and K =[jl](p ×k) • Diagonal matrix D = diag(1 , ... , k) • Blind source separation (BSS) [Talwar et al ’96], [van der Veen et al ’96] • Product of two matrices, finite alphabet (FA)/constant modulus (CM) constraint • (Generally) uniquely identifiable with enough number of samples • 3-way array (Candecomp/Parafac) [Kruskal ’77], [Sidiropoulos et al ’00] two-way • Unique up to permutation and scaling: • Fails to hold if h = 1

Uniqueness (cont’d) • Sparsity in blind identification • Sparse component analysis: very sparse representation [Georgiev et al ’07] • Non-negative source separation using local dominance [Chan et al ’08] Proposition: Consider where diagonal matrix D and binary-valued matrices R, K are all of full rank. Each column vector klof K is locally sparse8l, which means there exists an (unknown) row index jl such that Given Z, the matrices R, D, and K are unique up to permutations. • Proof relies on convex analysis • Affine hull of column vectors of K coincides with the one of Z • Under local sparseness, its convex hull becomes the intersection of the affine hull and the positive orthant • Columns of K are extreme points of convex hull • Results hold also when R is locally sparse (symmetry)

Preliminary Simulation • Two uniform blocks + noises ~ Unif [0, 0.5] • SOC parameters: k=2, S=20, T=1, and =0,3 = 0 Original Plaid Permuted = 3

Real Data To Simulate • Internet traffic flow data • Uncover different types of co-clusters: in-star,out-star, bi-mesh,.... • Examples of Email application: department servers, Gmail • Overlapping co-clusters may reveal server farms • Gene expression microarray data • Co-clusters may exhibit some biological patterns • Need to check with the gene enrichment value

Concluding Summary • Plaid models to reveal overlapping co-clusters • Exploit sparsity for parsimonious recovery • Jointly decide among multiple layers • SOC algorithm iteratively updates the unknown parameters • Coordinate descent solver for the layer level parameters • Sphere decoder detects membership indicators jointly • Local sparseness leads to unique decomposition

Thank You! Future Directions • Implementation issues with parameter choices • Efficient initializations and membership vector detection • Comprehensive numerical experiments on real data

Sparsity-Cognizant Overlapping Co-clustering

Sparsity-Cognizant Overlapping Co-clustering

Presentation Transcript

Privacy Cognizant Information Systems

Overlapping Toes

Overlapping Generations

DisCo : Distributed Co-clustering with Map-Reduce

Overlapping Events

Sparsity and Saliency

Overlapping

On the Advantage of Overlapping Clustering for Minimizing Conductance

DisCo : Distributed Co-clustering with Map-Reduce

Co-clustering using CUDA

Overlapping Gesture

Cognizant Technology Solutions

Overlapping Triangles

Overlapping Sets

Overlapping Orders

Idea of Co-Clustering

An Unsupervised Learning Approach for Overlapping Co-clustering

Overlapping Fish

Bayesian Co-clustering for Dyadic Data Analysis

Overlapping Generations

Sparsity-Cognizant Overlapping Co-clustering

Overlapping BSS Co-Existence