1 / 73

Compact Data Representations and their Applications

Compact Data Representations and their Applications. Moses Charikar Princeton University. Lots and lots of data. AT&T Information about who calls whom What information can be got from this data ? Network router Sees high speed stream of packets

anka
Télécharger la présentation

Compact Data Representations and their Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compact Data Representations and their Applications Moses Charikar Princeton University

  2. Lots and lots of data • AT&T • Information about who calls whom • What information can be got from this data ? • Network router • Sees high speed stream of packets • Detect DOS attacks ? fair resource allocation ?

  3. Lots and lots of data • Google search engine • About 3 billion web pages • Many many queries every day • How to efficiently process data ? • Eliminate near duplicate web pages • Query log analysis

  4. sketch complex object 0 1 0 1 1 0 0 1 1 0 0 0 1 0 1 1 0 0 1 0 Sketching Paradigm • Construct compact representation (sketch) of data such that • Interesting functions of data can be computed from compact representation estimated

  5. Why care about compact representations ? • Practical motivations • Algorithmic techniques for massive data sets • Compact representations lead to reduced space, time requirements • Make impractical tasks feasible • Theoretical Motivations • Interesting mathematical problems • Connections to many areas of research

  6. Questions • What is the data ? • What functions do we want to compute on the data ? • How do we estimate functions on the sketches ? • Different considerations arise from different combinations of answers • Compact representation schemes are functions of the requirements

  7. What is the data ? • Sets, vectors, points in Euclidean space, points in a metric space, vertices of a graph. • Mathematical representation of objects (e.g. documents, images, customer profiles, queries).

  8. What functions do we want to compute on the data ? • Local functions : pairs of objectse.g. distance between objects • Sketch of each object, such that function can be estimated from pairs of sketches • Global functions : entire data sete.g. statistical properties of data • Sketch of entire data set, ability to update, combine sketches

  9. Local functions: distance/similarity • Distance is a general metric, i.e satisfies triangle inequality • Normed spacex = (x1, x2, …, xd) y = (y1, y2, …, yd) • Other special metrics (e.g. Earth Mover Distance)

  10. Estimating distance from sketches • Arbitrary function of sketches • Information theory, communication complexity question. • Sketches are points in normed space • Embedding original distance function in normed space. [Bourgain ’85] [Linial,London,Rabinovich ’94] • Original metric is (same) normed space • Original data points are high dimensional • Sketches are points low dimensions • Dimension reduction in normed spaces[Johnson Lindenstrauss ’84]

  11. Streaming algorithms • Perform computation in one (or constant) pass(es) over data using a small amount of storage space • Availability of sketch function facilitates streaming algorithm • Additional requirements - sketch should allow: • Update to incorporate new data items • Combination of sketches for different data sets storage input

  12. Global functions • Statistical properties of entire data set • Frequency moments • Sortedness of data • Set membership • Size of join of relations • Histogram representation • Most frequent items in data set • Clustering of data

  13. Goals • Glimpse of compact representation techniques in the sketching and streaming domains. • Basic ideas, no messy details

  14. Talk Outline • Classical techniques: spectral methods • Dimension reduction • Similarity preserving hash functions • sketching vector norms • sketching Earth Mover Distance (EMD)

  15. Talk Outline • Dimension reduction • Similarity preserving hash functions • sketching vector norms • sketching Earth Mover Distance (EMD)

  16. f http://humanities.ucsd.edu/courses/kuchtahum4/pix/earth.jpg http://www.physast.uga.edu/~jss/1010/ch10/earth.jpg Low Distortion Embeddings • Given metric spaces (X1,d1) & (X2,d2),embedding f: X1 X2 has distortion D if ratio of distances changes by at most D • “Dimension Reduction” – • Original space high dimensional • Make target space be of “low” dimension, while maintaining small distortion

  17. Dimension Reduction in L2 • n points in Euclidean space (L2 norm) can be mapped down to O((log n)/2) dimensions with distortion at most 1+.[Johnson Lindenstrauss ‘84] • Two interesting properties: • Linear mapping • Oblivious – choice of linear mapping does not depend on point set • Quite simple [JL84, FM88, IM98, DG99, Ach01]: Even a random +1/-1 matrix works… • Many applications…

  18. Dimension reduction for L1 • [C,Sahai ‘02]Linear embeddings are not good for dimension reduction in L1 • There exist O(n) points in L1in n dimensions, such that any linear mapping with distortion  needs n/2dimensions

  19. Dimension reduction for L1 • [C, Brinkman ‘03]Strong lower boundsfor dimension reduction in L1 • There exist n points in L1, such that anyembedding with constant distortion  needs n1/2dimensions • Simpler proof by [Lee,Naor ’04] • Does not rule out other sketching techniques

  20. Talk Outline • Dimension reduction • Similarity preserving hash functions • sketching vector norms • sketching Earth Mover Distance (EMD)

  21. sketch complex object 0 1 0 1 1 0 0 1 1 0 0 0 1 0 1 1 0 0 1 0 Similarity Preserving Hash Functions • Similarity function sim(x,y), distance d(x,y) • Family of hash functions F with probability distribution such that

  22. Applications • Compact representation scheme for estimating similarity • Approximate nearest neighbor search [Indyk,Motwani ’98] [Kushilevitz,Ostrovsky,Rabani ‘98]

  23. Sketching Set Similarity:Minwise Independent Permutations [Broder,Manasse,Glassman,Zweig ‘97] [Broder,C,Frieze,Mitzenmacher ‘98]

  24. Other similarity functions ? [C’02] • Necessary conditions for existence of similarity preserving hash functions. • SPH does not exist for Dice coefficient and Overlap coefficient. • SPH schemes from rounding algorithms • Hash function for vectors based on random hyperplane rounding.

  25. Existence of SPH schemes • sim(x,y) admits an SPH scheme if family of hash functions F such that

  26. Theorem: If sim(x,y) admits an SPH scheme then 1-sim(x,y) satisfies triangle inequality. Proof:

  27. Non-existence of SPH

  28. Stronger Condition Theorem: If sim(x,y) admits an SPH scheme then (1+sim(x,y) )/2 has an SPH scheme with hash functions mapping objects to {0,1}. Theorem: If sim(x,y) admits an SPH scheme then 1-sim(x,y) is isometrically embeddable in the Hamming cube.

  29. For n vectors, random hyperplane can be chosen using O(log2 n) random bits.[Indyk], [Engebretson,Indyk,O’Donnell] • Alternate similarity measure for sets

  30. Random Hyperplane Rounding based SPH • Collection of vectors • Pick random hyperplane through origin (normal ) • [Goemans,Williamson]

  31. Sketching L1 • Design sketch for vectors to estimate L1 norm • Hash function to distinguish between small and large distances [KOR ’98] • Map L1 to Hamming space • Bit vectors a=(a1,a2,…,an) and b=(b1,b2,…,bn) • Distinguish between distances  (1-ε)n/k versus  (1+ε)n/k • XOR random set of k bits • Pr[h(a)=h(b)] differs by constant in two cases

  32. Sketching L1 via stable distributions • a=(a1,a2,…,an) and b=(b1,b2,…,bn) • Sketching L2 • f(a) = Σi ai Xi f(b) = Σi bi XiXi independent Gaussian • f(a)-f(b) has Gaussian distribution scaled by |a-b|2 • Form many coordinates, estimate |a-b|2 by taking L2 norm • Sketching L1 • f(a) = Σi ai Xi f(b) = Σi bi XiXi independent Cauchy distributed • f(a)-f(b) has Cauchy distribution scaled by |a-b|1 • Form many coordinates, estimate |a-b|1 by taking median[Indyk ’00] -- streaming applications

  33. Compact Representations in Streaming • Statistical properties of data streams • Distinct elements • Frequency moments, norm estimation • Frequent items

  34. Frequency Moments[Alon, Matias, Szegedy ’99] • Stream consists of elements from {1,2,…,n} • mi = number of times i occurs • Frequency moment • F0 = number of distinct elements • F1 = size of stream • F2 =

  35. Overall Scheme • Design estimator (i.e. random variable) with the right expectation • If estimator is tightly concentrated, maintain number of independent copies of estimator E1, E2, …, Er • Obtain estimate E from E1, E2, …, Er • Within (1+) with probability 1-

  36. Randomness • Design estimator assuming perfect hash functions, as much randomness as needed • Too much space required to explicitly store such a hash function • Fix later by showing that limited randomness suffices

  37. Distinct Elements • Estimate the number of distinct elements in a data stream • “Brute Force solution”: Maintain list of distinct elements seen so far • (n) storage • Can we do better ?

  38. Distinct Elements[Flajolet, Martin ’83] • Pick a random hash function h:[n]  [0,1] • Saythere are k distinct elements • Then minimum value of h over k distinct elements is around 1/k • Apply h() to every element of data stream; maintain minimum value • Estimator = 1/minimum

  39. (Idealized) Analysis • Assume perfectly random hash function h:[n]  [0,1] • S: set of k elements of [n] • X = min aS { h(a) } • E[X] = 1/(k+1) • Var[X] = O(1/k2) • Mean of O(1/2) independent estimators is within (1+) of 1/k with constant probability

  40. Analysis • [Alon,Matias,Szegedy]Analysis goes through with pairwise independent hash functionh(x) = ax+b • 2 approximation • O(log n) space • Many improvements[Bar-Yossef,Jayram,Kumar,Sivakumar,Trevisan]

  41. Estimating F2 • F2 = • “Brute force solution”: Maintain counters for all distinct elements • Sampling ? • n1/2space

  42. Estimating F2[Alon,Matias,Szegedy] • Pick a random hash functionh:[n]  {+1,-1} • hi = h(i) • Z = • Z initially 0, add hievery time you see i • Estimator X = Z2

  43. Analyzing the F2 estimator

  44. Analyzing the F2 estimator • Median of means gives good estimator

  45. Properties of F2 estimator • “sketch” of data stream that allows computation of • Linear function of mi • Can be added, subtracted • Given two streams, frequencies mi , ni • E[(Z1-Z2)2] = • Estimate L2 norm of difference • How about L1 norm ? Lp norm ?

  46. Talk Outline • Similarity preserving hash functions • Similarity estimation • Statistical properties of data streams • Distinct elements • Frequency moments, norm estimation • Frequent items

  47. Variants of F2 estimator[Alon, Gibbons, Matias, Szegedy] • Estimate join size of two relations(m1,m2,…) (n1,n2,…) • Variance may be too high

  48. In Conclusion • Simple powerful ideas at the heart of several algorithmic techniques for large data sets • “Sketches” of data tailored to applications • Many interesting research questions

  49. Content-Based Similarity Search Moses Charikar Princeton University Joint work with: Qin Lv, William Josephson, Zhe Wang, Perry Cook, Matthew Hoffman, Kai Li

  50. Motivation • Massive amounts of feature-rich digital data • Audio, video, digital photos, scientific sensor data • Noisy, high-dimensional • Traditional file systems/search tools inadequate • Exact match • Keyword-based search • Annotations • Need content-based similarity search

More Related