1 / 52

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data. Instructor: Luke Huan Fall, 2006. Model-Based Clustering. What is model-based clustering? Attempt to optimize the fit between the given data and some mathematical model

Télécharger la présentation

EECS 800 Research Seminar Mining Biological Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EECS 800 Research SeminarMining Biological Data Instructor: Luke Huan Fall, 2006

  2. Model-Based Clustering • What is model-based clustering? • Attempt to optimize the fit between the given data and some mathematical model • Based on the assumption: Data are generated by a mixture of underlying probability distribution • Typical methods • Statistical approach • EM (Expectation maximization), AutoClass • Machine learning approach • COBWEB, CLASSIT • Neural network approach • SOM (Self-Organizing Feature Map)

  3. EM —Expectation Maximization • EM —A popular iterative refinement algorithm • An extension to k-means • Assign each object to a cluster according to a weight (prob. distribution) • New means are computed based on weighted measures • General idea • Starts with an initial estimate of the parameter vector • Iteratively rescores the patterns against the mixture density produced by the parameter vector • The rescored patterns are used to update the parameter updates • Patterns belonging to the same cluster, if they are placed by their scores in a particular component • Algorithm converges fast but may not be in global optima • AutoClass (Cheeseman and Stutz, 1996)

  4. 1D Guassian Mixture Model • Given a set of data distributed in a 1D space, how to perform clustering in the data set? • General idea: factorize the p.d.f. into a mixture of simple models. • Discrete values: Bernoulli distribution • Continues values: Gaussian distribution

  5. The EM (Expectation Maximization) Algorithm • Initially, randomly assign k cluster centers • Iteratively refine the clusters based on two steps • Expectation step: assign each data point Xi to cluster Ci with the following probability • Maximization step: • Estimation of model parameters

  6. Another Way of K-mean? • Pos: • AutoClass can adapt to different (convex) shapes of clusters, k-mean assumes spheres • Solid statistics foundation • Cons: • computational expensive

  7. Model Based Subspace Clustering • Microarray • Bi-clustering • δ-clustering • p-clustering • OP-clustering

  8. MicroArray Dataset

  9. Gene Expression Matrix Genes Genes Conditions Time points Cancer Tissues Conditions

  10. Data Mining: Clustering K-means clustering minimizes Where

  11. Clustering by Pattern Similarity (p-Clustering) • The micro-array “raw” data shows 3 genes and their values in a multi-dimensional space • Parallel Coordinates Plots • Difficult to find their patterns • “non-traditional” clustering

  12. Clusters Are Clear After Projection

  13. Motivation • DNA microarray analysis

  14. Motivation

  15. Motivation • Strong coherence exhibits by the selected objects on the selected attributes. • They are not necessarily close to each other but rather bear a constant shift. • Object/attribute bias • bi-cluster

  16. Challenges • The set of objects and the set of attributes are usually unknown. • Different objects/attributes may possess different biases and such biases • may be local to the set of selected objects/attributes • are usually unknown in advance • May have many unspecified entries

  17. Previous Work • Subspace clustering • Identifying a set of objects and a set of attributes such that the set of objects are physically close to each other on the subspace formed by the set of attributes. • Collaborative filtering: Pearson R • Only considers globaloffset of each object/attribute.

  18. bi-cluster Terms • Consists of a (sub)set of objects and a (sub)set of attributes • Corresponds to a submatrix • Occupancy threshold  • Each object/attribute has to be filled by a certain percentage. • Volume: number of specified entries in the submatrix • Base: average value of each object/attribute (in the bi-cluster) • Biclustering of Expression Data, Cheng & Church ISMB’00

  19. bi-cluster

  20. 17 conditions 40 genes

  21. Motivation

  22. 17 conditions 40 genes

  23. Motivation Co-regulated genes

  24. bi-cluster • Perfect -cluster • Imperfect -cluster • Residue: dij diJ dIJ dIj

  25. bi-cluster • The smaller the average residue, the stronger the coherence. • Objective: identify -clusters with residue smaller than a given threshold

  26. Cheng-Church Algorithm • Find one bi-cluster. • Replace the data in the first bi-cluster with random data • Find the second bi-cluster, and go on. • The quality of the bi-cluster degrades (smaller volume, higher residue) due to the insertion of random data.

  27. The FLOC algorithm Generating initial clusters Determine the best action for each row and each column Perform the best action of each row and column sequentially Y Improved? N Yang et al. delta-Clusters: Capturing Subspace Correlation in a Large Data Set, ICDE’02

  28. The FLOC algorithm • Action: the change of membership of a row(or column) with respect to a cluster column M=4 1 2 3 4 row 3 4 2 2 1 M+N actions are Performed at each iteration 2 1 3 2 3 N=3 3 4 2 0 4

  29. The FLOC algorithm • Gainof an action: the residue reduction incurred by performing the action • Order of action: • Fixed order • Random order • Weighted random order • Complexity: O((M+N)MNkp)

  30. The FLOC algorithm • Additional features • Maximum allowed overlap among clusters • Minimum coverage of clusters • Minimum volume of each cluster • Can be enforced by “temporarily blocking” certain action during the mining process if such action would violate some constraint.

  31. Performance • Microarray data: 2884 genes, 17 conditions • 100 bi-clusters with smallest residue were returned. • Average residue = 10.34 • The average residue of clusters found via the state of the art method in computational biology field is 12.54 • The average volume is 25% bigger • The response time is an order of magnitude faster

  32. Conclusion Remark • The model of bi-cluster is proposed to capture coherent objects with incomplete data set. • base • residue • Many additional features can be accommodated (nearly for free).

  33. p-Clustering: Clustering by Pattern Similarity • Given object x, y in O and features a, b in T, pCluster is a 2 by 2 matrix • A pair (O, T) is in δ-pCluster if for any 2 by 2 matrix X in (O, T), pScore(X) ≤ δ for some δ > 0 • For scaling patterns, one can observe, taking logarithmic on will lead to the pScore form H. Wang, et al., Clustering by pattern similarity in large data sets, SIGMOD’02.

  34. Coherent Cluster Want to accommodate noises but not outliers

  35. dxa dxb x x dya z dyb y y a a b b Coherent Cluster • Coherent cluster • Subspace clustering • pair-wise disparity • For a 22 (sub)matrix consisting of objects {x, y} and attributes {a, b} mutual bias of attribute a mutual bias of attribute b attribute

  36. Coherent Cluster • A 22 (sub)matrix is a -coherent cluster if its D value is less than or equal to . • An mn matrix X is a -coherent cluster if every22 submatrix of X is -coherent cluster. • A -coherent cluster is a maximum-coherent cluster if it is not a submatrix of any other -coherent cluster. • Objective: given a data matrix and a threshold , find all maximum -coherent clusters.

  37. Coherent Cluster • Challenges: • Finding subspace clustering based on distance itself is already a difficult task due to the curse of dimensionality. • The (sub)set of objects and the (sub)set of attributes that form a cluster are unknown in advance and may not be adjacent to each other in the data matrix. • The actual values of the objects in a coherent cluster may be far apart from each other. • Each object or attribute in a coherent cluster may bear some relative bias (that are unknown in advance) and such bias may be local to the coherent cluster.

  38. Coherent Cluster Compute the maximum coherent attribute sets for each pair of objects Two-way Pruning Construct the lexicographical tree Post-order traverse the tree to find maximum coherent clusters

  39. 7 o1 5 3 o2 1 a1 a2 a3 a4 a5 3 2 3.5 2 2.5  [2, 3.5] Coherent Cluster • Observation: Given a pair of objects {o1, o2} and a (sub)set of attributes {a1, a2, …, ak}, the 2ksubmatrix is a -coherent cluster iff, for every attribute ai, the mutual bias (do1ai – do2ai) does not differ from each other by more than . If  = 1.5, then {a1,a2,a3,a4,a5} is a coherent attribute set (CAS) of (o1,o2).

  40. a1 a2 a3 a4 a5 a6 a7 o1 o2 o3 o4 o5 o6 Coherent Cluster • Observation: given a subset of objects {o1, o2, …, ol} and a subset of attributes {a1, a2, …, ak}, the lksubmatrix is a -coherent cluster iff {a1, a2, …, ak} is a coherent attribute set for every pair of objects (oi,oj) where 1  i, j  l.

  41. 7 r1 5 7 3 r2 r1 5 1 3 r2 1 a1 a1 a2 a2 a3 a3 a4 a4 a5 a5 3 3 2 2 3.5 3.5 2 2 2.5 2.5 Coherent Cluster • Strategy: find the maximum coherent attribute sets for each pair of objects with respect to the given threshold .  = 1 The maximum coherent attribute sets define the search space for maximum coherent clusters.

  42. (o0,o2) →(a0,a1,a2) (o1,o2) →(a0,a1,a2) (a0,a1) →(o0,o1,o2) (a0,a2) →(o1,o2,o3) (a1,a2) →(o1,o2,o4) (a1,a2) →(o0,o2,o4) MCAS Two Way Pruning (a0,a1) →(o0,o1,o2) (a0,a2) →(o1,o2,o3) (a1,a2) →(o1,o2,o4) (a1,a2) →(o0,o2,o4) (o0,o2) →(a0,a1,a2) (o1,o2) →(a0,a1,a2) delta=1 nc =3 nr = 3 MCOS

  43. attributes objects Coherent Cluster • Strategy: grouping object pairs by their CAS and, for each group, find the maximum clique(s). • Implementation: using a lexicographical tree to organize the object pairs and to generate all maximum coherent clusters with a single post-order traversal of the tree.

  44. (o0,o1): {a0,a1}, {a2,a3} (o0,o2): {a0,a1,a2,a3} (o0,o4): {a1,a2} (o1,o2): {a0,a1,a2}, {a2,a3} (o1,o3): {a0,a2} (o1,o4): {a1,a2} (o2,o3): {a0,a2} (o2,o4): {a1,a2} {a0,a1} :(o0,o1) (o1,o2) (o0,o2) {a0,a2} :(o1,o3),(o2,o3) (o1,o2) (o0,o2) {a1,a2} :(o0,o4),(o1,o4),(o2,o4) (o1,o2) (o0,o2) {a2,a3} :(o0,o1),(o1,o2) (o0,o2) {a0,a1,a2} :(o1,o2) (o0,o2) {a0,a1,a2,a3} :(o0,o2) a0 a2 a1 assume  = 1 a1 a2 a2 a3 (o0,o1) (o1,o3) (o0,o4) (o0,o1) (o2,o3) (o1,o4) (o1,o2) a2 (o2,o4) (o1,o2) a3 (o0,o2)

  45. Coherent Cluster • High expressive power • The coherent cluster can capture many interesting and meaningful patterns overlooked by previous clustering methods. • Efficient and highly scalable • Wide applications • Gene expression analysis • Collaborative filtering subspace cluster coherent cluster

  46. Remark • Comparing to Bicluster • Can well separate noises and outliers • No random data insertion and replacement • Produce optimal solution

  47. Definition of OP-Cluster • Let I be a subset of genes in the database. Let J be a subset of conditions. We say <I, J> forms an Order Preserving Cluster (OP-Cluster),if one of the following relationships exists for any pair of conditions. Experssion Levels A1 A2 A3 A4 when

  48. Problem Statement • Given a gene expression matrix, our goal is to find all the statistically significant OP-Clusters. The significance is ensured by the minimal size threshold nc and nr.

  49. Conversion to Sequence Mining Problem Sequence: Experssion Levels A1 A2 A3 A4

  50. Ming OP-Clusters: A naïve approach • A naïve approach • Enumerate all possible subsequences in a prefix tree. • For each subsequences, collect all genes that contain the subsequences. • Challenge: • The total number of distinct subsequences are root a a b c d b b c d a c d … c d d b d b c c d a d … d c d b c b d c d a … A Complete Prefix Tree with 4 items {a,b,c,d}

More Related