560 likes | 714 Vues
Algorithms for geometric data streams . Christian Sohler, TU Dortmund. Introduction. Data streams Massive data set arriving sequentially Different ways of „arriving“ Examples Network traffic Query logs … Approach
E N D
Algorithms for geometric data streams Christian Sohler, TU Dortmund
Introduction Data streams • Massive data set arriving sequentially • Different ways of „arriving“ Examples • Network traffic • Query logs • … Approach • Find algorithms that make a single (a few) pass(es) and process data sequentially
Introduction Geometric data streams • Massive sets of geometric objects arriving sequentially • Objects are typically points • Different form of arrival:- sequence of points- sequence of updates Questions • Find ways to analyze the geometric structure of the input data using small space
Introduction Motivation • Many computational tasks can be interpreted geometrically • Geometric features may be useful in learning and classification • Geometry plays an important role in the application Examples • Learning • Clustering • How ‚clusterable‘ is a data set? • Road traffic prediction
Introduction A basic learning problem • We have two classes of objects
Introduction A basic learning problem • We have two classes of objects
Introduction A basic learning problem • We have two classes of objects • We are given examples from both classes
Introduction A basic learning problem • We have two classes of objects • We are given examples from both classes
Introduction A basic learning problem • We have two classes of objects • We are given examples from both classes • Learn from examples to which classfuture objects belong ?
Introduction A basic learning problem • We have two classes of objects • We are given examples from both classes • Learn from examples to which classfuture objects belong • Map object‘s description to Euclidean space ?
Introduction A basic learning problem • We have two classes of objects • We are given examples from both classes • Learn from examples to which classfuture objects belong • Map object‘s description to Euclidean space SVM approach • Compute maximum margin hyperplane • Classifiy points according to their side ?
Introduction SVM and SEB (smallest enclosing balls) • Dual of certain SVM formulation is SEB [Tax, Duin, Pattern Recognition Letters, ‘99] • Geometric streaming SEB can be used as SVM heuristic [Rai, Daume III, Venkatasubramanian, IJCAI‘09] • Also: Coresets have been usedto construct CSVMs[Tsang, Kwok, Cheung, Journal of Machine Learning Research, ’05] ?
Introduction Outline • Merge & Reduce • Embeddings into tree metrics • Estimation of distribution of local neighborhoods • Balanced partitions • Approximating properties of balanced partitions
Merge & Reduce Insertion-only streams • Sequence of points p ,…, p from R d n 1
Merge & Reduce Definition [k-median clustering] Given a weighted set P of points in R the k-median problem is to find a set CR of k points (centers) such that cost(P,C) = S w min ||p-c|| is minimized, where w >0 is the weight of point p. d d p pP cC p
Merge & Reduce Coreset[Har-Peled, Mazumdar, STOC’04] A weighted point set S is a (k,e)-coreset of a weighted point set P, if for every set C of k centers | cost(P,C) – cost(S,C) | e cost(P,C). 3 3 3 3 4 3 4
Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset … Input Stream
Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset Coreset … Input Stream
Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset … Input Stream
Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset Coreset … Input Stream
Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset Coreset of Union of Coreset … Input Stream
Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset … Input Stream
Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset … Input Stream
Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset … Input Stream
Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset … Input Stream
Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset … Input Stream
Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset … Input Stream
Merge & Reduce Coresets by pre-clustering [Guha, Mishra, Motwani, O‘Callaghan, FOCS’00; Har-Peled, Mazumdar, STOC’04; Frahling, S., STOC‘05] • Compute a pre-clustering S with >k centers and cost(P,S) e Opt • Size exponential in d k 3 3 3 3 4 3 4
Merge & Reduce Coresets by sampling[Chen, SICOMP’09; Feldman, Monemizadeh, S., SoCG‘07] • Compute a random non-uniform sample • Show that sample approximates all solutions from a net • Size polynomial in d M M/4 M
Merge & Reduce Coresets by reduction to 1D [Har-Peled, Kushal, DCG’07, Feldman, Fiat, Sharir, FOCS‘06] • Uses geometric arguments to solve 1D • Combine with preclusting using line centers • For k-median: Size independent of n (but exponential in d)
Merge & Reduce Open problems • Coresets for k-median of size independent of n and d ? (Partial result in [Feldman, Monemizadeh, S., SoCG’07]) • Coresets for k-median of size O(d/e²) • Coresets for k-median of size poly(d, log n)/e for constant c=c(d)>0 • Coresets for j-subspace 1-median of size poly(e, d, j, log n) ? • Same questions for k-means objective function Remark: Open questions refer to the definition of coresets from this talk. 2-c
Geometric update streams Insertion/deletion model • Stream consists of Insert(p), Delete(p) operations • Points are from {1,…, D} • Stream is consistent, i.e. no Delete(p), if p is not present and noInsert(p), if p is already present in the current set d
Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics t p s q r
Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics t q p t p s r s q r
Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics t q p t p s r s i i 2 q 2 i 2 q r p s t r i 2
Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics t q p t p s r s i i 2 q 2 i 2 q r p s t r i-1 2 i-1 i-1 2 2 s q r p i-1 2
Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics t q p t p s r s i i 2 q 2 i 2 q r p s t r i-1 2 i-1 i-1 2 2 s q r p i-2 2 i-2 2 r s
Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics t q p t p s r s i i 2 q 2 i 2 q r p s t r i-1 2 i-1 i-1 2 2 s q r p i-2 2 i-2 2 r s
Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics D(.,.) • ||p-q|| D(p,q) • E[D(p,q)] = O(log D)||p-q|| [Bartal, FOCS’96; Charikar, Chekuri, Goel, Guha, Plotkin, FOCS’98] q t p s r i i 2 2 i 2 q r p s t i-1 2 i-1 i-1 2 2 s q r p i-2 2 i-2 2 r s
Streaming algorithms viaembeddings into tree metrics Estimator for cost of Euclidean minimum spanning tree (EMST) [Indyk, STOC’04] • Write EMST for cost of EMST • Write MST for cost of minimum spanning tree of tree metric D • E[MST ] = O(log D) EMST (linearity of expectation) • Use cost of MST of D as estimator D D
Streaming algorithms via embeddings into tree metrics Observation [Indyk, STOC’04] • The MST of D(.,.) is given by the tree defining the tree metric • #edges of length 2 = #non-empty cells in corresponding grid i t p i 2 s q t q p s r i i 2 2 i 2 r q r p s t
Streaming algorithms viaembeddings into tree metrics Euclidean minimum spanning tree 1. Use O(log D) nested grids G(i) with side length 2 • for each grid • approximate |G(i)| := #nonempty cells in G(i) using F sketch • returnS 2 |G(i)| Theorem [Indyk, STOC’04] The above algorithm computes a O(log D)-approximation to the cost of the minimum spanning tree. i 0 i
Streaming algorithms viaembeddings into tree metrics Results using a similar approach [Indyk, STOC’04] Problem Approx. factor
Streaming algorithms viaestimating the distribution of local neighborhoods Distribution of neighborhoods • Grids G(i) as before • R-neighborhood of C: cells within distance at most R from C • m (i) is number of points in i-th cell of the R-neighborhood of C C,R A cell and its 2-neighborhood
Streaming algorithms viaestimating the distribution of local neighborhoods EMST estimator • Define Z (i) = ( m (i) > 0 ) • EMST can be approximated from the Z (i) • Approx. ratio goes to 1 as R goes to C,R C,R C,R
Streaming algorithms viaestimating the distribution of local neighborhoods EMST estimator • K: Size of R-neighborhood • Z are functions from {1,…,K} to {0,1} • Random (nonempty) C defines distribution over neighborhoods, i.e. over functions Z:{1,…,K} {0,1} • Can still estimate EMST from this distribution C,R
Streaming algorithms viaestimating the distribution of local neighborhoods Algorithm • Sample a certain number of nonempty grid cells and maintain number of points for each cell in their neighborhood • Sample gives estimation of the distribution of the Z (.) • Obtain estimation for EMST from estimated distribution Theorem [Frahling, Indyk, S., IJCGA’07] Let e>0, d be constants.The cost of a Euclidean minimum spanning tree of a point set in R given as an update stream can be estimated with a factor of 1e using polylog(D) space. C,R d
Streaming algorithms viaestimating the distribution of local neighborhoods Open Problems • (1+e)-approximation for matching and/or earth mover‘s distance • Other problems? Approach is not very well understood • General characterization of problems solvable via approximation of the distribution of local neighborhoods
Streaming algorithms viabalanced partitions Estimating the distribution [Frahling, S., STOC’05] • Divide space into regions • For each region maintain #points inside • Balance „error“ among regions • Notion of error depends on problem Example • 1-Median in 1D • Error cell width #points in cell
Streaming algorithms viabalanced partitions Small space? • Problem dependent • Need to show that decomposition in few regions with sufficiently small error exists