Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining

# Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining

## Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Computing Sketches of MatricesEfficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute drinep@cs.rpi.edu (joint work with R. Kannan and M. Mahoney) @ DIMACS Workshop on Privacy Preserving Data Mining

2. Motivation (Data Mining) • In many applications large matrices appear (too large to store in RAM). • We can make a few “passes” (sequential READS) through the matrices. • We can create and store a small “sketch” of the matrices in RAM. • Computing the “sketch” should be a very fast process. • Discard the original matrix and work with the “sketch”.

3. Motivation (Privacy Preserving) • In many applications, instead of revealing a large matrix, we only reveal its “sketch”. • Intuition: The “sketch” is an approximation to the original matrix. • Instead of viewing the approximation as a “necessary evil”, we might be able to use it to achieve privacy preservation (similar ideas in Feigenbaum et. al., ICALP 2001). • Goal: Formulate a technical definition of privacy that might be achievable by such “sketching” algorithms and provide meaningful and quantifiable protection. • Achieving the goal is an open problem !

4. Create an approximation to the original matrixwhich can be stored in much less space. Our approach & our results • A “sketch” consisting of a few rows/columns of the matrix is adequate for efficient approximations. • [see D & Kannan ’03, and D, Kannan & Mahoney ’04] • We draw the rows/columns randomly, using adaptive sampling; e.g. rows/columns are picked with probability proportional to their lengths.

5. Overview • A Data Mining setup • Approximating a large matrix • Algorithm • Error bounds • Tightness of the results • An alternative approach (Achlioptas and McSherry ’01 and ’03) • Conclusions

6. Applications: Data Mining We are given m (>106) objects and n(>105) features describing the objects. Database An m-by-n matrix A (Aij shows the “importance” of feature j for object i). Every row of A represents an object. Queries Given a new object x, find similar objects in the database (nearest neighbors).

7. Applications (cont’d) Two objects are “close” if the angle between their corresponding vectors is small. So, assuming that the vectors are normalized, xT·d = cos(x,d) is high when the two objects are close. A·x computes all the angles and answers the query. • Key observation: The exact value xT· d might not be necessary. • The feature values in the vectors are set by coarse heuristics. • It is in general enough to see if xT·d > Threshold.

8. The CUR algorithm guarantees a bound on the worst case choice of x. Using an approximation to A Assume that A’ = CUR is an approximation to A, such that A’ is stored efficiently (e.g. in RAM). Given a query vector x, instead of computing A · x, compute A’ · x to identify its nearest neighbors.

9. Approximating A efficiently • Given a large m-by-n matrix A (stored on disk), compute an approximation A’ to A such that: • A’ can be stored in O(m+n) space, after making two passes through the entire matrix A, and using O(m+n) additional space and time. • A’ satisfies (with high probability) • ||A-A’||22 < ε ||A||F2 • (and a similar bound with respect to the Frobenius norm).

10. Describing A’ = C · U · R • C consists of c = θ(1/ε2) columns of A and R consists of r = θ(1/ε2) rows of A (the “description length” of A is O(m+n)). • C and R are created using adaptive sampling.

11. Create C (R) by performing c (r) i.i.d trials. • In each trial, pick a column (row) of A with probability • Include A(i) (A(i)) as a column of C (R). • [A(i) (A(i)) is the i-th column (row) of A] Creating C and R

12. Singular Value Decomposition (SVD) U (V): orthogonal matrix containing the left (right) singular vectors of A. S: diagonal matrix containing the singular values of A. • Exact computation of the SVD takes O(min(mn2 , m2n)) time. • The top few singular vectors/values can be approximated faster (Lanczos/ Arnoldi methods).

13. Rank k approximations (Ak) Uk (Vk): orthogonal matrix containing the top k left (right) singular vectors of A. Sk: diagonal matrix containing the top k singular values of A. Ak is a matrix of rank k such that ||A-Ak||2,F is minimized over all rank k matrices!

14. The CUR algorithm • Input: • The matrix A in “sparse unordered representation”. • (e.g. non-zero entries of A are presented as triples (i,j,Aij) in any order) • Positive integers c < n and r < m (number or columns/rows that we pick). • Positive integer k (the rank of A’=CUR). Note: Since A’ is of rank k, ||A-A’||2,F >= ||A-Ak||2,F. We choose a k such that ||A-Ak||2,F is small. As k grows, for the Frobenius norm approximation, c and r grow as well.

15. e.g. Computing U • Intuition: • The CUR algorithm essentially expresses every row of the matrix A as a linear combination of a small subset of the rows of A. • This small subset consists of the rows in R. • Given a row of A – say A(i) – the algorithm computes the “best fit” for the row A(i) using the rows in R as the basis. Notice that only c = O(1) element of the i-th row are given as input. However, a vector of coefficients u can still be computed.

16. Creating U Running time Computing the elements of U amounts to a pseudo-inverse computation. It can be done in O(c2m + c3 + r3) time. Thus, U can be computed in O(m) time. Note on the rank of U and CUR The rank of U (by construction) is k. Thus, the rank of A’=CUR is at most k.

17. Error bounds (Frobenius norm) Assume Ak is the “best” rank k approximation to A (through SVD). Then We need to pick O(k/ε2) rows and O(k/ε2) columns.

18. Error bounds (2-norm) Assume Ak is the “best” rank k approximation to A (through SVD). Then since |A-Ak|22 <= |A|F2/(k+1). We need to pick O(1/ε2) rows and O(1/ε2) columns.

19. Can we do better? Lemma For any e < 1, there is a set of Ω(e–n) n-by-n matrices, such that for two distinct matrices A,B in the set, ||A-B||22 > (e/20)||A||F2 Lower bound Theorem Any algorithm which approximates these matrices must output a different “sketch” for each one, thus it must output at least Ω(n log(1/e)) bits Tighter lower bounds, matching almost exactly with our upper bounds, have been obtained by Ziv-Bar Yossef, STOC ’03.

20. A different technique • (D. Achlioptas and F. McSherry, ’01 and ’03) • The Algorithm in 2 lines: • To approximate a matrix A, keep a few elements of the matrix (instead of rows or columns) and zero out the remaining elements. • Compute a rank k approximation to this sparse matrix (using Lanczos methods). • Comparing the two techniques: • The error bound w.r.t. the 2-norm is better, while the error bound w.r.t. the Frobenius norm is the same. • (weighted sampling is used - heavier elements are kept with higher probabilities) • Running times are the same.

21. Conclusions • Given the small “sketch” of a matrix A, a “friendly user” can • reconstruct a (provably accurate) approximation A’ to the original matrix A and employ any algorithms that he would use to process the original matrix A on A’, • use the Frobenius and spectral norm bounds for A-A’ to argue about the approximation error of his algorithms. • How do we ensure privacy for the object-vectors (rows) of A that are revealed as part of R? • Are such sketches offering some privacy preserving guarantees, under some (relaxed) definition of privacy?