1 / 48

Grid-based Coresets for Clustering Problems

Grid-based Coresets for Clustering Problems. Christian Sohler Universität Paderborn (joint work with Gereon Frahling). Introduction Clustering. Clustering Partition input in sets (cluster), such that - Objects in same cluster are similar - Objects in different clusters are dissimilar

trizzo
Télécharger la présentation

Grid-based Coresets for Clustering Problems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Grid-based Coresets for Clustering Problems Christian Sohler Universität Paderborn (joint work with Gereon Frahling)

  2. IntroductionClustering • Clustering • Partition input in sets (cluster), such that- Objects in same cluster are similar - Objects in different clusters are dissimilar • Goal • Simplification • Discovery of patterns • Procedure • Map objects to Euclidean space => point set P • Points in same cluster are close • Points in different clusters are far away from eachother

  3. Introductionk-means clustering • Clustering with Prototypes • One prototyp (center) for each cluster • k-Median Clustering • k clusters C ,…,C • One center c for each cluster C • Minimize S S d(p,c ) 1 k i i i pC i i

  4. Introductionk-means clustering • Clustering with Prototypes • One prototyp (center) for each cluster • k-Median Clustering • k clusters C ,…,C • One center c for each cluster C • Minimize S S d(p,c ) 1 k i i i pC i i

  5. Introductionk-means clustering • Clustering with Prototypes • One prototyp (center) for each cluster • k-Median Clustering • k clusters C ,…,C • One center c for each cluster C • Minimize S S d(p,c ) 1 k i i i pC i i

  6. (128,59,88) (218,181,163) IntroductionSimplification / Lossy Compression

  7. IntroductionSimplification / Lossy Compression

  8. IntroductionSimplification / Lossy Compression

  9. IntroductionProperties of k-means • Simple property of k-median • Point set P • Set of centers C • Best clustering: Assign each point to nearest center

  10. IntroductionProperties of k-means • Simple property of k-median • Point set P • Set of centers C • Best clustering: Assign each point to nearest center

  11. IntroductionProperties of k-means • Simple property of k-median • Point set P • Set of centers C • Best clustering: Assign each point to nearest center

  12. IntroductionProperties of k-means • Simple property of k-median • Point set P • Set of centers C • Best clustering: Assign each point to nearest center

  13. IntroductionProperties of k-means • Simple property of k-median • Point set P • Set of centers C • Best clustering: Assign each point to nearest center Notation: cost(P,C) denotes the cost of the solution defined this way

  14. IntroductionCoresets • Definition (Coreset for k-median) [HM04] • A weighted point set S is called e-coreset for P, if for every set C of k centers we have • (1-e) cost(P,C)  cost(S,C)  (1+e) cost(P,C)

  15. IntroductionCoresets • Definition (Coreset for k-median) [HM04] • A weighted point set S is called e-coreset for P, if for every set C of k centers we have • (1-e) cost(P,C)  cost(S,C)  (1+e) cost(P,C) • Replace point set by few weighted points(red) 3 4 5 5 4

  16. IntroductionCoresets • Definition (Coreset for k-median) [HM04] • A weighted point set S is called e-coreset for P, if for every set C of k centers we have • (1-e) cost(P,C)  cost(S,C)  (1+e) cost(P,C) 3 4 5 5 4

  17. IntroductionCoresets • Definition (Coreset for k-median) [HM04] • A weighted point set S is called e-coreset for P, if for every set C of k centers we have • (1-e) cost(P,C)  cost(S,C)  (1+e) cost(P,C) 3 4 5 5 4

  18. IntroductionRelated work • Coresets for Clustering Problems • k-center, k-median [Badoiu, Indyk, Har-Peled, 2002]existence of coresets, size independent of dimension • Projective clustering [Har-Peled, Varadarajan, 2002] • existence of coresets for projective clustering, faster algorithms • k-median, k-means [Har-Peled, Mazumdar, 2004]faster algorithms, data streaming, different definition of coresets • k-median, k-means [Har-Peled, Kushal, 2004]coresets of constant size • k-median [Chen, 2005]coreset with size polynomial in dimension • K-median, k-means, MaxCut [Frahling, S., 2005]‚oblivious‘ coreset construction, dynamic data streams

  19. Coresets for clustering problems • k-means [Frahling, Sohler, 2006]efficient implementation • k-line median [Fiat, Feldman, Sharir, 2006]coresets for low dimensions • k-median, k-means [Feldman, Momemizadeh, Sohler, 2006]weak coresets; size independent of n and d

  20. Coreset constructionFirst try • Our Approach • Partition the input space into regions • For each region R • Count number w(R) of points in R • choose one representative point p from R • Assign weight w(R) to p • Remove all other points from R • Analysis • Moving a point by distance d changes cost(P,C) by at most d • Sum up movement for all regions • Show: Overall movement is at most ecost(P,C)

  21. Coreset constructionFirst try • Our Approach • Partition the input space into regions • For each region R • Count number w(R) of points in R • choose one representative point p from R • Assign weight w(R) to p • Remove all other points from R Only question: How to find regions?

  22. Coreset constructionFirst try • Our Approach • Partition the input space into regions • For each region R • Count number w(R) of points in R • choose one representative point p from R • Assign weight w(R) to p • Remove all other points from R First try: Regular grid with width W

  23. Coreset constructionFirst try • Our Approach • Partition the input space into regions • For each region R • Count number w(R) of points in R • choose one representative point p from R • Assign weight w(R) to p • Remove all other points from R First try: Regular grid with width W

  24. Coreset constructionFirst try • Our Approach • Partition the input space into regions • For each region R • Count number w(R) of points in R • choose one representative point p from R • Assign weight w(R) to p • Remove all other points from R First try: Regular grid with width W • Error per cell: • O(W  #points in cell) • W e cost(P,C)/n • Too many cells

  25. Coreset constructionFirst try • Our Approach • Partition the input space into regions • For each region R • Count number w(R) of points in R • choose one representative point p from R • Assign weight w(R) to p • Remove all other points from R Second try:Refine grid till cells have at most R points

  26. Coreset constructionFirst try • Our Approach • Partition the input space into regions • For each region R • Count number w(R) of points in R • choose one representative point p from R • Assign weight w(R) to p • Remove all other points from R Second try:Refine grid till cells have at most R points per cell

  27. Coreset constructionFirst try • Our Approach • Partition the input space into regions • For each region R • Count number w(R) of points in R • choose one representative point p from R • Assign weight w(R) to p • Remove all other points from R Second try:Refine grid till cells have at most R points per cell • Error per cell: • O(Cell width R) • There can be point at distance Opt • Re • Too many cells

  28. Coreset constructionSome definitions • Assumptions • Cost Opt of optimal k-median solution is known • Grid i has cell width Opt / 2 • O(log n) levels • Definition: • Cell in grid i is called heavy, if it contains more than d2 points. • A cell that is not heavy is light. • Observation: • „Movement cost“ for light cells is O(dOpt) • Construction: • Put coreset point in every light cell whose parent cell is heavy i i

  29. Coreset constructionThe algorithm Computation of coreset points Opt

  30. Coreset constructionThe algorithm Computation of coreset points

  31. Coreset constructionThe algorithm Computation of coreset points

  32. Coreset constructionThe algorithm Computation of coreset points 1

  33. Coreset constructionThe algorithm Computation of coreset points 1

  34. Coreset constructionThe algorithm Computation of coreset points 1 1 1 1 3 1 3 1 1 1 1

  35. Coreset constructionThe algorithm Computation of coreset points 1 1 1 1 3 1 3 1 1 1 1

  36. Coreset constructionThe algorithm Computation of coreset points 1 1 1 5 5 1 5 2 3 1 3 1 1 1 1

  37. Coreset constructionAnalysis d • Coreset size  2  #heavy cells 1/ecell width

  38. Coreset constructionAnalysis d • Coreset size  2  #heavy cells 1/ecell width • Number of „inner“ heavy cells per grid: • k/e (volume argument) d d

  39. Coreset constructionAnalysis d • Coreset size  2  #heavy cells Contribution of „outer“ heavy cell ≥d/e cost(P,C) Number of outer heavy cells per grid e/d 1/ecell width • Number of „inner“ heavy cells per grid: • k/e (volume argument) d

  40. Coreset constructionAnalysis • Coreset size  O(log n (e/d + k/e )) d Contribution of „outer“ heavy cell ≥d/e cost(P,C) Number of outer heavy cells per grid e/d 1/ecell width • Number of „inner“ heavy cells: • k/e (volume argument) d

  41. Coreset constructionAnalysis • Coreset size  O(log n (e/d + k/e )) d 1/ecell width

  42. Coreset constructionAnalysis • Coreset size  O(log n (e/d + k/e )) d 1/ecell width Outer cells: Movement can be charged to contribution  Overall cost e  cost(P,C)

  43. Coreset constructionAnalysis • Coreset size  O(log n (e/d + k/e )) d Inner cells: Cost per cell d  Opt 1/ecell width Outer cells: Movement can be charged to contribution  Overall cost e  cost(P,C)

  44. Coreset constructionAnalysis • Coreset size  O(log n (e/d + k/e )) d Inner cells: Cost per cell d  Opt 1/ecell width #inner cells  k/e  d=e / log n Outer cells: Movement can be charged to contribution  Overall cost e  cost(P,C) d+1

  45. Coreset Summary • Theorem • Our construction gives a coreset of size O(k log n / e ) • Dynamic geometric data streams • Stream of Insert(p)/Delete(p) operations; p  {1,…,D} • Stream consistent: no Delete(p), if p is not in current set • Algorithm • Output: Set of k centers • Maintains Coreset • Compute centers from coreset using (1+e)-approx. algorithm d d

  46. StreamingCoreset maintenance • How to maintain coreset • (1+e)-approx. of number of points in heavy cells sufficient • For grids with cell width Opt/2 we need approximation for all cells with more than d 2 points • Solution • Uniform random sampling will do • Reason • Size of grid cell imposes restriction on distribution • Sample hits only few cells • So, small space suffices i i

  47. Conclusions • Summary • Streaming algorithm for insertions and deletions • Maintains coreset • Computes (1+e)-approximation from coreset • Some more progress on… • High dimensional dynamic data streams • Sliding window model (low dimensional)

  48. Thank you! Christian Sohler Heinz Nixdorf Institut & Institut für Informatik Universität Paderborn Fürstenallee 11 33102 Paderborn, Germany Tel.: +49 (0) 52 51/60 64 27 Fax: +49 (0) 52 51/62 64 82 E-Mail: csohler@upb.de http://www.upb.de/cs/ag-madh

More Related