TUTORIAL: Randomized Algorithms for Matrices and Massive Data Sets

TUTORIAL:Randomized Algorithms for Matrices and Massive Data Sets Michael W. Mahoney Yale University Dept. of Mathematics (now at Yahoo! Research) Petros Drineas RPI Dept. of Computer Science (Most recent copy) available at: http://www.cs.yale.edu/homes/mmahoney http://www.cs.rpi.edu/~drinep

Randomized Linear Algebra Algorithms • Goal: To develop and analyze (fast) Monte Carlo algorithms for performing useful computations on large matrices and tensors. • Matrix Multiplication • Computation of the Singular Value Decomposition • Computation of the CUR Decomposition • Testing Feasibility of Linear Programs • Least Squares Approximation • Tensor computations: SVD generalizations • Tensor computations: CUR generalization • Such computations generally require time which is superlinear in the number of nonzero elements of the matrix/tensor, e.g., O(n3) for n £ n matrices.

Randomized Linear Algebra Algorithms Motivation: • (Algorithmic) To speed up computations in applications where extremely large data sets are modeled by matrices/tensors and, e.g., O(n3) time is not an option. • (Algorithmic) To reduce the memory requirements in applications where the data sets are modeled by matrices/tensors, and storing the whole data is not an option, because the data sets are extremely large. • (Equivalent to the above) To provide some analysis of the performance of the accuracy performance of simple algorithms when only a small sample of the full data set is available. • (Structural) To reveal novel structural properties of the datasets, given sufficient computational time.

Carefully chosen U O(1) rows O(1) columns Example: the CUR decomposition Goal: make (some norm) of A-CUR small. Why? After making two passes over A, we can compute provably good C, U, and R and store them (“sketch”) instead of A: O(m+n) vs. O(n2) space. Why? Given a sample consisting of a few columns (C) and a few rows (R) of A, we can compute U and “reconstruct” A as CUR; if the sampling probabilities are not “too bad”, we get provably good accuracy. Why? Given sufficient time, we can find C, U and R such that A – CUR is “very” small. This might lead to better understanding of the data.

Applications of such algorithms • Matrices arise, e.g., since m objects (documents, genomes, images, web pages), each with n features, may be represented by an m x n matrix A. • Covariance Matrices • Latent Semantic Indexing • DNA Microarray Data • Eigenfaces and Image Recognition • Similarity Queries • Matrix Reconstruction • LOTS of other data applications!! • More generally, • Linear and Nonlinear Programming Applications • Design of Approximation Algorithms • Statistical Learning Theory Applications

Overview (1/3) • Data Streaming Models and Random Sampling • Matrix Multiplication • Feasibility testing of Linear Programs • Singular Value Decomposition • Sampling rows/columns to approximate singular vectors/values • Sampling elements to approximate singular vectors/values • CUR Matrix Decomposition • Lower bounds on sampling complexities

Overview (2/3) • Applications of Matrix CUR • Designing approximation algorithms • Data mining • Microarray data • Recommendation Systems • Tensor-based data sets • Tensor-CUR • Applications of Tensor-CUR • Hyperspectral data • Recommendation systems

Overview (3/3) • Kernel-CUR and the Nystrom Method • Regression problems • Least squares problems • The CUR algorithm revisited • A general bound • Iterative sampling of rows/columns • “Optimal” choices of rows/columns • Application: finding ht-SNPs • Conclusions & open problems

Computation on Massive Data Sets Data are too large to fit into main memory; they are either not stored or are stored in external memory. Algorithms that compute on data streams examine the stream, keep a small “sketch” of the data, and perform computations on the sketch. Algorithms are (typically) randomized and approximate. • Evaluate performance by measures such as: • the time to process an item • the number of passes over the data • the additional workspace and time • the quality of the approximation returned • Munro & Paterson ’78: studied “the relation between the amount of internal storage available and the number of passes required to select the k-th highest of n inputs.”

The Pass Efficient Model • Motivation: Amount of disk/tape space has increased enormously; RAM and computing speeds have increased less rapidly. • Can store large amounts of data, but • Cannotprocess these data with traditional algorithms. • In the Pass-Efficient Model: • Data are assumed to be stored on disk/tape. • Algorithm has access to the data via a pass over the data. • Algorithm is allowed additional RAM space and additional computation time. • An algorithm is pass-efficient if it uses a small constant number of passes and sublinear additional time and space to compute a description of the solution. Note: If data are an m x n matrix A, then algorithms which require additional time and space that is O(m+n) or O(1) are pass-efficient.

Random Sampling Random sampling is used to estimate some parameter defined over a very large set by looking at only a very small subset. Uniform Sampling: every piece of data is equally likely to be picked. • Advantages: • “Coins” can be tossed “blindly.” • Even if the number of data elements is not known in advance, can select one element u.a.r. in one pass over the data. • Much recent work on quantities that may be approximated with a small uniformly drawn sample. • Disadvantages: • Many quantities cannot be approximated well with a small random sample that is uniformly drawn. • E.g., estimate the sum of N numbers, where one number is very large and the others very small.

Randomized Algorithms for Linear Algebra problems: A “sketch” consisting of a small number of judiciously chosen and randomly sampled rows and columns (or elements) is sufficient for provably rapid and efficient approximation of many matrix operations. Random Sampling (cont’d) • Nonuniform Sampling: • Advantages: • Can obtain much more generality and big gains, e.g., can approximately solve problems in sparse as well as dense matrices. • Smaller sampling complexity for similar bounds. • Disadvantages: • Must determine the nonuniform probabilities; multiple passes over the data usually needed.

i-th row of B i-th column of A Approximating Matrix Multiplication … (D. & Kannan FOCS ’01, and D., Kannan, & M. TR ’04, SICOMP ’05) Problem Statement Given an m-by-n matrix A and an n-by-p matrix B, approximate the product A·B, OR, equivalently, Approximate the sum of n rank-one matrices. Each term in the summation is a rank-one matrix

i-th row of B i-th column of A …by random sampling • Algorithm • Fix a set of probabilities pi, i=1…n, summing up to 1. • For t=1 up to s, • set jt = i, where Pr(jt = i) = pi; • (Pick sterms of the sum, with replacement, with respect to the pi.) • Approximate AB by the sum of the s terms, after scaling.

i-th row of B i-th column of A Random sampling (cont’d) Keeping the terms j1, j2, … js.

Algorithm • Pick scolumns of A to form an m-by-s matrix C and the corresponding srows of B to form an s-by-p matrix R. • (discard A and B) Approximate A · B by C · R. The algorithm (matrix notation)

Create C and R by performing s i.i.d. trials, with replacement. • For t=1 up to s, pick a column A(jt) and a row B(jt) with probability • Include A(jt)/(spjt)1/2 as a column of C, and B(jt)/(spjt)1/2 as a row of R. The algorithm (matrix notation, cont’d)

Simple Lemmas • The input matrices are given in “sparse unordered representation”; e.g., their non-zero entries are presented as triples (i, j, Aij) in any order. • The expectation of CR (element-wise) is AB. • Our nonuniform sampling minimizes the variance of the estimator. • It is easy to implement the sampling in two passes. • If the matrices aredense the algorithm runs in O(smp) time, instead of O(nmp) time, • It requires O(sm+sp) RAM space. • Does not tamper with the sparsity of the matrices.

Error Bounds For the above algorithm, If ||AB||F = (||A||F ||B||F), then the above bound is a relative error bound. This happens if there is “not much cancellation” in the multiplication.

Error Bounds (tight concentration) For the above algorithm, with probability at least 1-, Notice that we removed the expectation (by applying a martingale argument) and having an extra log(1/) factor. (Markov’s inequality would also remove the expectation, introducing an extra 1/ factor.)

Special case: B = AT If B = AT, then the sampling probabilities are Also, R = CT, and the error bounds are

Uses a result of M. Rudelson for random vectors in isotropic position. • Tight concentration results can be proven using Talagrand’s theory. • The sampling procedure is slightly different; s columns/rows are kept in expectation, i.e., column i is picked with probability Special case: B = AT (cont’d) (Rudelson & Vershynin ’04, Vershynin ’04) Improvement for the spectral norm bound for the special case B = AT.

Empirical evaluation: setup (Data from D. Lewis & E. Cohen, SODA ’97 & J. of Algorithms ’99) Database Dense document-concept matrix A with 5000 documents and 320 concepts. Experiment Our goal was to identify all document-document matches, e.g. compute AAT and identify all entries larger than some threshold  (proximity problem). The algorithm • Approximate AAT (using our method) by CCT. • Find all entries of CCT that are larger that -,  > 0 (“candidate entries”). • Deterministically compute the dot products for “candidate entries”. • Return the ones that are larger than  as matches.

Empirical evaluation: results ( = 0.85)

Empirical evaluation: results ( = 0.75)

Experiments: uniform sampling

Form Q, a random non-uniform subset of {1…n}, with |Q|=q. Q is formed in q i.i.d. trials, where With probability at least 1-, the following LP is feasible as well: Random sampling of Linear Programs (D., Kannan, & M. TR ‘04 & STACS ’05) Let P(i) denote the i th column of the r-by-n matrix P and suppose that the following Linear Program is feasible:

If feasible … …then feasible (with high probability). The picture … Sample a few variables!

The i-th constraint is feasible Relaxing the constraint guarantees feasibility. The bound on db emerges by applying the Matrix Multiplication result on the matrix-vector product Px. Discarding variables “perturbs” the constraint Another picture …

Form Q, a random non-uniform subset of {1…n}, with |Q|=q. Q is formed in q i.i.d. trials, where With probability at least 1-, the following LP is infeasible as well: A converse LP sampling lemma Let P(i) denote the i th column of the r-by-n matrix P and suppose that the following Linear Program is infeasible:

If infeasible … …then infeasible (with high probability). The picture … Sample a few variables!

The i-th constraint is infeasible Tightening the constraint guarantees infeasibility. The bound on db emerges by applying the Matrix Multiplication result on the matrix-vector product Px. Discarding variables “perturbs” the constraint Another picture …

Singular Value Decomposition (SVD) U (V): orthogonal matrix containing the left (right) singular vectors of A. S: diagonal matrix containing the singular values of A. • Exact computation of the SVD takes O(min{mn2 , m2n}) time. • The top few singular vectors/values can be approximated faster (Lanczos/ Arnoldi methods).

Rank k approximations (Ak) Uk (Vk): orthogonal matrix containing the top k left (right) singular vectors of A. Sk: diagonal matrix containing the top k singular values of A. Also, Ak=UkUkTA. Ak is a matrix of rank k such that ||A-Ak||2,F is minimized over all rank k matrices! This property of very useful in the context of Principal Component Analysis.

Approximating SVD in O(n) time • (Frieze, Kannan & Vempala FOCS ‘98, D., Frieze, Kannan, Vempala & Vinay SODA ’99, JML ’04, D. Kannan, & M. TR ’04, SICOMP ’05) • Given: m x n matrix A • Sample c columns from A and rescale to form the m x c matrix C. • Compute the m x k matrix Hk of the top k left singular vectors of C. Structural Theorem: For any probabilities and number of columns: ||A-HkHkTA||2,F2 ≤ ||A-Ak||2,F2 + 2√k||AAT-CCT||F Algorithmic Theorem: If pi = |A(i)|2/||A||F2 and c ≥ 42k/2, then: ||A-HkHkTA||2,F2 ≤ ||A-Ak||2,F2 + ||A||F2. Proof: via matrix multiplication theorem and matrix perturbation theory.

C Example of randomized SVD A Title: C:\Petros\Image Processing\baboondet.eps Creator: MATLAB, The Mathworks, Inc. Preview: This EPS picture was not saved with a preview included in it. Comment: This EPS picture will print to a PostScript printer, but not to other types of printers. Original matrix After sampling columns Compute the top k left singular vectors of the matrix C and store them in the 512-by-k matrix Hk.

Example of randomized SVD (cont’d) Title: C:\Petros\Image Processing\baboondet.eps Creator: MATLAB, The Mathworks, Inc. Preview: This EPS picture was not saved with a preview included in it. Comment: This EPS picture will print to a PostScript printer, but not to other types of printers. A HkHkTA A and HkHkTA are close.

More details: Let pij2 [0,1] for all i,j. Create the matrix S from A such that: Element-wise sampling (Achlioptas & McSherry, STOC ’01, JACM ’05) • The Algorithm in 2 lines: • To approximate a matrix A, keep a few elements of the matrix (instead of rows or columns) and zero out the remaining elements. • Compute a rank k approximation to this sparse matrix (using Lanczos methods). ||A-S||2 is bounded !(i) the singular values of A and S are close, and (ii, under additional assumptions) the top k left (right) singular vectors of S span a subspace that is close the to subspace spanned by the top k left (right) singular vectors of A.

How to use it • Approximating singular values fast: • Zero out (a large number of) elements of A, scale the remaining ones appropriately. • Compute the singular values of the resulting sparse matrix using iterative techniques. • (Good choice for pij: pij = sAij2/i,j Aij2, where s denotes the expected number of elements that we seek to keep in S.) • Note: Each element is kept or discarded independently of the others. • Similar ideas have been used to: • explain the success of Latent Semantic Indexing (LSI) • (Papadimitriou, Raghavan, Tamaki, Vempala, PODS ’98 & Azar, Fiat, Karlin, McSherry, Saia STOC ’01) • design recommendation systems • (Azar, Fiat, Karlin, McSherry, Saia STOC ’01 ) • speedup kernel computations • (Achlioptas, McSherry, and Schölkopf, NIPS ’02)

Element-wise vs. row/column sampling • Quite different techniques! • Row/column sampling preserves structural properties of the matrices. • Element-wise sampling explains how adding noise and/or quantizing the elements of a matrix perturbs its singular values/vectors. • The two techniques could be complementary! • Some similarities and differences: • Similar error bounds. • Element-wise sampling is doable in one pass! • Running time of element-wise sampling depends on the speed of, e.g., Arnoldi methods.

Carefully chosen U An application: Given a query vector x, instead of computing A · x, compute CUR · x to identify its nearest neighbors, O(1) rows O(1) columns A novel CUR matrix decomposition (D. & Kannan, SODA ’03, D., Kannan, & M. TR ’04, SICOMP ’05) Create an approximation to the original matrix of the following form:

C consists of c = O(k/2) columns of A. • R consists of r = O(k/2) rows of A. • C (R) is created using importance sampling, e.g. columns (rows) are picked in i.i.d. trials with respect to probabilities The CUR decomposition Given a large m-by-n matrix A (stored on disk), compute a decomposition CUR of A such that:

The CUR decomposition (cont’d) • Given a large m-by-n matrix A (stored on disk), compute a decomposition CUR of A such that: • C, U, R can be stored in O(m+n) space, after making two passes through the entire matrix A, using O(m+n) additional space and time. • The product CUR satisfies (with high probability)

Computing U • Intuition (which can be formalized): • The CUR algorithm essentially expresses every row of the matrix A as a linear combination of a small subset of the rows of A. • This small subset consists of the rows in R. • Given a row of A – say A(i) – the algorithm computes a good fit for the row A(i) using the rows in R as the basis, by approximately solving Notice that only c = O(1) element of the i-th row are given as input. However, a vector of coefficients u can still be computed.

Computing U (cont’d) Given c elements of A(i) the algorithm computes a good fit for the row A(i) using the rows in R as the basis, by approximately solving: However, our CUR decomposition approximates the vectors u instead of exactly computing them. Later in this talk: new error bounds using the optimal coefficients.

If we pick O(1/2) rows and O(1/2) columns, Error bounds for CUR Assume Ak is the “best” rank k approximation to A (through SVD). Then, if we pick O(k/2) rows and O(k/2) columns,

Other (randomized) CUR decompositions • Computing U in constant (O(1) instead of O(m+n)) space and time: • (D., Kannan, & M. TR ’04, SICOMP ’05) • samples O(poly(k,)) rows and columns of A & needs an extra pass. • significantly improves the error bounds of Frieze, Kannan, and Vempala, FOCS ’98. • Computing U and R for any C: • (D., M., & Muthukrishnan ’05) • For any subset of the columns, denoted C (e.g., chosen by the practitioner) • Obtain bounds of the form: || A - CUR ||F ≤ ( 1 +  ) || A - CC+ A ||F • Uses ideas from approximating L2 Regression problems by random sampling. • Can combine with recent algorithms/heuristics for choosing columns; more details later.

TUTORIAL: Randomized Algorithms for Matrices and Massive Data Sets