1 / 56

Algorithms for geometric data streams 

Algorithms for geometric data streams . Christian  Sohler, TU Dortmund. Introduction. Data streams Massive data set arriving sequentially Different ways of „arriving“ Examples Network traffic Query logs … Approach

ananda
Télécharger la présentation

Algorithms for geometric data streams 

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algorithms for geometric data streams  Christian  Sohler, TU Dortmund

  2. Introduction Data streams • Massive data set arriving sequentially • Different ways of „arriving“ Examples • Network traffic • Query logs • … Approach • Find algorithms that make a single (a few) pass(es) and process data sequentially

  3. Introduction Geometric data streams • Massive sets of geometric objects arriving sequentially • Objects are typically points • Different form of arrival:- sequence of points- sequence of updates Questions • Find ways to analyze the geometric structure of the input data using small space

  4. Introduction Motivation • Many computational tasks can be interpreted geometrically • Geometric features may be useful in learning and classification • Geometry plays an important role in the application Examples • Learning • Clustering • How ‚clusterable‘ is a data set? • Road traffic prediction

  5. Introduction A basic learning problem • We have two classes of objects

  6. Introduction A basic learning problem • We have two classes of objects

  7. Introduction A basic learning problem • We have two classes of objects • We are given examples from both classes

  8. Introduction A basic learning problem • We have two classes of objects • We are given examples from both classes

  9. Introduction A basic learning problem • We have two classes of objects • We are given examples from both classes • Learn from examples to which classfuture objects belong ?

  10. Introduction A basic learning problem • We have two classes of objects • We are given examples from both classes • Learn from examples to which classfuture objects belong • Map object‘s description to Euclidean space ?

  11. Introduction A basic learning problem • We have two classes of objects • We are given examples from both classes • Learn from examples to which classfuture objects belong • Map object‘s description to Euclidean space SVM approach • Compute maximum margin hyperplane • Classifiy points according to their side ?

  12. Introduction SVM and SEB (smallest enclosing balls) • Dual of certain SVM formulation is SEB [Tax, Duin, Pattern Recognition Letters, ‘99] • Geometric streaming SEB can be used as SVM heuristic [Rai, Daume III, Venkatasubramanian, IJCAI‘09] • Also: Coresets have been usedto construct CSVMs[Tsang, Kwok, Cheung, Journal of Machine Learning Research, ’05] ?

  13. Introduction Outline • Merge & Reduce • Embeddings into tree metrics • Estimation of distribution of local neighborhoods • Balanced partitions • Approximating properties of balanced partitions

  14. Merge & Reduce Insertion-only streams • Sequence of points p ,…, p from R d n 1

  15. Merge & Reduce Definition [k-median clustering] Given a weighted set P of points in R the k-median problem is to find a set CR of k points (centers) such that cost(P,C) = S w min ||p-c|| is minimized, where w >0 is the weight of point p. d d p pP cC p

  16. Merge & Reduce Coreset[Har-Peled, Mazumdar, STOC’04] A weighted point set S is a (k,e)-coreset of a weighted point set P, if for every set C of k centers | cost(P,C) – cost(S,C) | e  cost(P,C). 3 3 3 3 4 3 4

  17. Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset … Input Stream

  18. Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset Coreset … Input Stream

  19. Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset … Input Stream

  20. Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset Coreset … Input Stream

  21. Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset Coreset of Union of Coreset … Input Stream

  22. Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset … Input Stream

  23. Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset … Input Stream

  24. Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset … Input Stream

  25. Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset … Input Stream

  26. Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset … Input Stream

  27. Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset … Input Stream

  28. Merge & Reduce Coresets by pre-clustering [Guha, Mishra, Motwani, O‘Callaghan, FOCS’00; Har-Peled, Mazumdar, STOC’04; Frahling, S., STOC‘05] • Compute a pre-clustering S with >k centers and cost(P,S)  e  Opt • Size exponential in d k 3 3 3 3 4 3 4

  29. Merge & Reduce Coresets by sampling[Chen, SICOMP’09; Feldman, Monemizadeh, S., SoCG‘07] • Compute a random non-uniform sample • Show that sample approximates all solutions from a net • Size polynomial in d M M/4 M

  30. Merge & Reduce Coresets by reduction to 1D [Har-Peled, Kushal, DCG’07, Feldman, Fiat, Sharir, FOCS‘06] • Uses geometric arguments to solve 1D • Combine with preclusting using line centers • For k-median: Size independent of n (but exponential in d)

  31. Merge & Reduce Open problems • Coresets for k-median of size independent of n and d ? (Partial result in [Feldman, Monemizadeh, S., SoCG’07]) • Coresets for k-median of size O(d/e²) • Coresets for k-median of size poly(d, log n)/e for constant c=c(d)>0 • Coresets for j-subspace 1-median of size poly(e, d, j, log n) ? • Same questions for k-means objective function Remark: Open questions refer to the definition of coresets from this talk. 2-c

  32. Geometric update streams Insertion/deletion model • Stream consists of Insert(p), Delete(p) operations • Points are from {1,…, D} • Stream is consistent, i.e. no Delete(p), if p is not present and noInsert(p), if p is already present in the current set d

  33. Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics t p s q r

  34. Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics t q p t p s r s q r

  35. Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics t q p t p s r s i i 2 q 2 i 2 q r p s t r i 2

  36. Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics t q p t p s r s i i 2 q 2 i 2 q r p s t r i-1 2 i-1 i-1 2 2 s q r p i-1 2

  37. Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics t q p t p s r s i i 2 q 2 i 2 q r p s t r i-1 2 i-1 i-1 2 2 s q r p i-2 2 i-2 2 r s

  38. Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics t q p t p s r s i i 2 q 2 i 2 q r p s t r i-1 2 i-1 i-1 2 2 s q r p i-2 2 i-2 2 r s

  39. Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics D(.,.) • ||p-q||  D(p,q) • E[D(p,q)] = O(log D)||p-q|| [Bartal, FOCS’96; Charikar, Chekuri, Goel, Guha, Plotkin, FOCS’98] q t p s r i i 2 2 i 2 q r p s t i-1 2 i-1 i-1 2 2 s q r p i-2 2 i-2 2 r s

  40. Streaming algorithms viaembeddings into tree metrics Estimator for cost of Euclidean minimum spanning tree (EMST) [Indyk, STOC’04] • Write EMST for cost of EMST • Write MST for cost of minimum spanning tree of tree metric D • E[MST ] = O(log D)  EMST (linearity of expectation) • Use cost of MST of D as estimator D D

  41. Streaming algorithms via embeddings into tree metrics Observation [Indyk, STOC’04] • The MST of D(.,.) is given by the tree defining the tree metric • #edges of length 2 = #non-empty cells in corresponding grid i t p i 2 s q t q p s r i i 2 2 i 2 r q r p s t

  42. Streaming algorithms viaembeddings into tree metrics Euclidean minimum spanning tree 1. Use O(log D) nested grids G(i) with side length 2 • for each grid • approximate |G(i)| := #nonempty cells in G(i) using F sketch • returnS 2 |G(i)| Theorem [Indyk, STOC’04] The above algorithm computes a O(log D)-approximation to the cost of the minimum spanning tree. i 0 i

  43. Streaming algorithms viaembeddings into tree metrics Results using a similar approach [Indyk, STOC’04] Problem Approx. factor

  44. Streaming algorithms viaestimating the distribution of local neighborhoods Distribution of neighborhoods • Grids G(i) as before • R-neighborhood of C: cells within distance at most R from C • m (i) is number of points in i-th cell of the R-neighborhood of C C,R A cell and its 2-neighborhood

  45. Streaming algorithms viaestimating the distribution of local neighborhoods EMST estimator • Define Z (i) = ( m (i) > 0 ) • EMST can be approximated from the Z (i) • Approx. ratio goes to 1 as R goes to  C,R C,R C,R

  46. Streaming algorithms viaestimating the distribution of local neighborhoods EMST estimator • K: Size of R-neighborhood • Z are functions from {1,…,K} to {0,1} • Random (nonempty) C defines distribution over neighborhoods, i.e. over functions Z:{1,…,K} {0,1} • Can still estimate EMST from this distribution C,R

  47. Streaming algorithms viaestimating the distribution of local neighborhoods Algorithm • Sample a certain number of nonempty grid cells and maintain number of points for each cell in their neighborhood • Sample gives estimation of the distribution of the Z (.) • Obtain estimation for EMST from estimated distribution Theorem [Frahling, Indyk, S., IJCGA’07] Let e>0, d be constants.The cost of a Euclidean minimum spanning tree of a point set in R given as an update stream can be estimated with a factor of 1e using polylog(D) space. C,R d

  48. Streaming algorithms viaestimating the distribution of local neighborhoods Open Problems • (1+e)-approximation for matching and/or earth mover‘s distance • Other problems? Approach is not very well understood • General characterization of problems solvable via approximation of the distribution of local neighborhoods

  49. Streaming algorithms viabalanced partitions Estimating the distribution [Frahling, S., STOC’05] • Divide space into regions • For each region maintain #points inside • Balance „error“ among regions • Notion of error depends on problem Example • 1-Median in 1D • Error  cell width  #points in cell

  50. Streaming algorithms viabalanced partitions Small space? • Problem dependent • Need to show that decomposition in few regions with sufficiently small error exists

More Related