1 / 28

Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

Applying Electromagnetic Field Theory Concepts to Clustering with Constraints. Huseyin Hakkoymaz, Georgios Chatzimilioudis, Dimitrios Gunopulos and Heikki Mannila. Motivation. Well-known problem, Dimensionality Curse : As the # of dimensions increases, distance

barid
Télécharger la présentation

Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Applying Electromagnetic Field Theory Concepts to Clustering with Constraints Huseyin Hakkoymaz, Georgios Chatzimilioudis, Dimitrios Gunopulos and Heikki Mannila

  2. Motivation • Well-known problem, Dimensionality Curse: • As the # of dimensions increases, distance metrics start losing their functionality • Relative Distances • Unlike exact distances, relative distance-based metrics have some immunity for the curse • Shortest path calculated by using the edges in graphs • Local distance adjustments • In many domains, local changes affect whole system • Cancer cells in body, sensor depletion in a network, etc… • The same idea is valid for distance metrics. • Relative distances supported by pairwise constraints performs much better • Constraints cause changes in local distances (a) Change of a unit shape as dimen-sionality increases (b) The distance matrix becomes useless if dimensions keeps increasing

  3. Negative Constraint Motivation(2) • Best environment to realize objectives GRAPH • Graph considered as Electromagnetic Field (EMF) • Pairwise constraints expressed naturally • Constraints EMF sources exerting force over edges • The force causes reduction or escalation of edge weights • No limitation for reduction/escalation amount thanks to graph domain • Cartesian space metrics bounded by triangular inequality

  4. Related Work • Distance metric learning [ Xing et al.]: • Global Linear transformation of data point • Different weights for each dimension • Shortcomings: • May fail in some cases, • Euclidian distance may utilize better   • Integrating constraints and metric learning in semi-supervised clustering [Bilenko et al.]: • Local weights for each cluster • Readjustment of weights at each iteration • Combines constraints and metric learning in objective function • Shortcomings: • Sometimes fails to adjust weights locally, • No guarantee for better accuracy with more constraints K-Means KMeans+Dist. Metric K-Means w1x = w1y w2x = w2y MPCK-Means w1x > w1y w2x < w2y

  5. Related Work • Semi-supervised Graph Clustering: A Kernel Approach [Kulis et al.]: • Mapping of data points into new feature space • Similaritybetween Kernel-KMeans and graph clustering objectives • Works for both vector andgraph data • Shortcomings: • Optimal Kernel required for good results • Time to compute optimal kernel is high • Relies mostly on min-cut objective, not distance Correct Clustering SS-Kernel-Means Approach

  6. Magnetically Affected Paths (MAP) • Two special edges for constraints: • Positive Edge : Must-link constraints • Negative Edge: Cannot-link constraints • Definitions: • Reduction Ratio: Amount of decrement in edge weight(+) • Escalation Ratio: Amount of increment in edge weight (-) _ Positive Edges Negative Edge

  7. Midpoint e(u,v) s t Magnetically Affected Paths (MAP) • Each constraint edge affects regular edges based on: • Constraint type • Vertical Distance (vd):Distance to the constraint axis • Horizontal Distance (hd):Distance to the mid-point of the constraint axis • Vertical and Horizontal Effects Probabilistic model • if vdincreases, effectdecreases for both (+) and (-) constraints • if hdincreases, effectdecreases for (-) constaints • hd has no effect on (+) constraints effect vd hd hd(u,v) vd(u,v) axis s t e(u,v) Vertical Distance Horizontal Distance

  8. Magnetically Affected Paths (MAP) • Compute escalation/reduction ratios of each constraint where _ and Typically, qe/qr = ~1.6 w(u,v) v u t s r = vertical distance effect ∆ = horizontal distance effect qe = weight of cannot link constraint qr = weight of must link constraint

  9. Magnetically Affected Paths (MAP) • Compute overall escalation/reduction ratio on an edge • Multiply overall ratio by edge weight to assign new edge weight (1<α<∞) _ Overall effect on an edge is quantified as total effect of all constraints _

  10. EMC (ElectroMagnetic Field Based Clustering) Framework • 3 steps clustering framework • Graph Construction • Readjustment of Edge Weights • Clustering Process

  11. EMC (ElectroMagnetic Field Based Clustering) Framework • Graph Construction • Select the n-nearest neighbors for each object • Connect the neighborhood and use Euclidean distance as edge weight • If graph not connected, add new edges between disconnected components • Readjustment of Edge Weights • Apply the MAP concept on graph • all (+) and (-) edges applied before clustering step • Extract new affinity matrix using new edge weights • Employ k-shortest path distance as distance metric • Better than single shortest path • Can utilize MAP better • Very slow for large graphs

  12. EMC (ElectroMagnetic Field Based Clustering) Framework • Clustering Process • Run clustering algorithm using new affinity matrix • Any clustering algorithm compatible with graphs • K-Means • Hierarchical • SS-Kernel-KMeans, etc… • We have used K-Medoids and Hierarchical clustering algorithms • Since they have similar results, we report only K-Medoids results • Small amount of constraints improves accuracy significantly • Other algorithms need more constraints to achieve same performance _

  13. Two improvements for k-shortest paths • K-SD shortest path algorithm • Based on Dijkstra algorithm • Each vertex keeps k-distance entries • Paths are distinct (two paths cannot have a common edge) • Just k times slower than Dijkstra algorithm • Divide-and-Conquer approach (Multilevel approach) • Partition the graph using multilevel graph partitioning • Kmetis: partitions large graphs into equal-sized subgraphs • Very fast (takes just a few seconds to partition very large graphs) • Identify hubs • The nodes residing on the boundary of a partition • Connected to at least two partitions • These are the only way from one partition to next partition . Hubs between two partitions

  14. Two improvements for k-shortest paths • Divide-and-Conquer approach (Cont.) • Extract distance matrix for each partition • Merge the distance matrices using the hubs • At least 20 times faster compared to original K-SD shortest path algorithm • Applicable to very large graphs

  15. Divide-and-Conquer Approach

  16. Constructing Hub graph and extracting SHub matrix

  17. SHub Constructing Hub graph and extracting SHub matrix

  18. Computing of K-SD shortest path distance

  19. SHub Computing of K-SD shortest path distance Update distances from first partition’snode1 to second partitionhubs through first hub • SHubis used for transition from first partitionhubs to second partitionhubs

  20. SHub Computing of K-SD shortest path distance Update distances from first partition’snode1 to second partitionhubs through second hub • SHubis used for transition from first partitionhubs to second partitionhubs

  21. SHub Computing of K-SD shortest path distance Update distances from first partition’snode1 to second partitionhubs through last hub • SHubis used for transition from first partitionhubs to second partitionhubs

  22. SHub Computing of K-SD shortest path distance Update distances from second partitionnodes to first partition’snode1 to through second partitions hubs At this moment, all second partitionhubs have their distances to the first partition’snode1 • SHubis used for transition from first partitionhubs to second partitionhubs

  23. Experiments • Implemented in Java and Matlab • Synthetic and real datasets • Datasets from UCI Machine Learning Repository: • Soybean, Iris, Wine, Ionosphere, Balance, Breast cancer, Satellite

  24. Experiments • EMCK-Means Experiments: • Graph construction • Varied # of paths and # of nearest neighbors • Readjustment phase • Constraint amount is increased by %10·|Dataset| • Compared against to: • MPCK-Means: Unifies distance-based and metric based approaches • Diagonal Metric: Learns a distance metric with weighted dimensions • EMCK-Means:MAP implementation with K-Medoids • SS-Kernel-KMeans:Performs graph clustering based on min-cut objective • Experimental Setup: • Same constraint sets used for each algorithm • Constraints are chosen at random • %x .N where N is the dataset size • Run each algorithm 200 times

  25. Experiments • Clustering results for EMCK-Means on : • Wine, Balance, Breast Cancer, Ionosphere, Iris and Soybean dataset • We adjust number of shortest paths ranging from 5 to 20.

  26. Comparison of Algorithms • Comparison of EMC, MPCK-Means, KMeans+Diagonal metric and SS-Kernel-KMeans • OutperformsIris, Balance and Ionosphere • Reasonable for Soybean and Breast Cancer • Almost no gain at all for Wine

  27. Conclusions • EMC framework offers flexible and more accurate clustering in graph domain • We can integrate other clustering algorithms into the framework • Small amount of constraints improves accuracy significantly • Applicability of more constraints at any time • Time reduces significantly as we increase # of partitions,p • Future Works • Multilevel EMC • Coarsen the graph • Perform clustering • Refinement • Performs much faster than other algorithms without any significant change in accuracy • No hubs or merge process • _

  28. Thank you!

More Related