Information Retrieval & Data Mining: A Linear Algebraic Perspective

Information Retrieval & Data Mining: A Linear Algebraic Perspective Petros Drineas Rensselaer Polytechnic Institute Computer Science Department To access my web page: drineas

Modern data Facts Computers make it easy to collect and store data. Costs of storage are very low and are dropping very fast. (most laptops have a storage capacity of more than 100 GB …) When it comes to storing data The current policy typically is “store everything in case it is needed later” instead of deciding what could be deleted.

Data mining Facts Computers make it easy to collect and store data. Costs of storage are very low and are dropping very fast. (most laptops have a storage capacity of more than 100 GB …) When it comes to storing data The current policy typically is “store everything in case it is needed later” instead of deciding what could be deleted. Data Mining Extract useful information from the massive amount of available data.

About the tutorial Tools Introduce matrix algorithms and matrix decompositions for data mining and information retrieval applications. Goal Learn a model for the underlying “physical” system generating the dataset.

About the tutorial Tools Introduce matrix algorithms and matrix decompositions for data mining and information retrieval applications. Goal Learn a model for the underlying “physical” system generating the dataset. data Math is necessary to design and analyze principled algorithmic techniques to data-mine the massive datasets that have become ubiquitous in scientific research. mathematics algorithms

Why linear (or multilinear) algebra? Data are represented by matrices Numerous modern datasets are in matrix form. Data are represented by tensors Data in the form of tensors (multi-mode arrays) are becoming very common in the data mining and information retrieval literature in the last few years.

Why linear (or multilinear) algebra? Data are represented by matrices Numerous modern datasets are in matrix form. Data are represented by tensors Data in the form of tensors (multi-mode arrays) are becoming very common in the data mining and information retrieval literature in the last few years. Linear algebra (and numerical analysis) provide the fundamental mathematical and algorithmic tools to deal with matrix and tensor computations. (This tutorial will focus on matrices; pointers to some tensor decompositions will be provided.)

Why matrix decompositions? • Matrix decompositions • (e.g., SVD, QR, SDD, CX and CUR, NMF, MMMF, etc.) • They use the relationships between the available data in order to identify components of the underlying physical system generating the data. • Some assumptions on the relationships between the underlying components are necessary. • Very active area of research; some matrix decompositions are more than one century old, whereas others are very recent.

Overview • Datasets in the form of matrices (and tensors) • Matrix Decompositions • Singular Value Decomposition (SVD) • Column-based Decompositions (CX, interpolative decomposition) • CUR-type decompositions • Non-negative matrix factorization • Semi-Discrete Decomposition (SDD) • Maximum-Margin Matrix Factorization (MMMF) • Tensor decompositions • Regression • Coreset constructions • Fast algorithms for least-squares regression

Datasets in the form of matrices We are given m objects and n features describing the objects. (Each object has n numeric values describing it.) Dataset An m-by-n matrix A, Aij shows the “importance” of feature j for object i. Every row of A represents an object. Goal We seek to understand the structure of the data, e.g., the underlying process generating the data.

Market basket matrices n products (e.g., milk, bread, wine, etc.) Common representation for association rule mining. • Data mining tasks • Find association rules • E.g., customers who buy product x buy product y with probility 89%. • Such rules are used to make item display decisions, advertising decisions, etc. m customers Aij = quantity of j-th product purchased by the i-th customer

Social networks (e-mail graph) n users Represents the email communications between groups of users. • Data mining tasks • cluster the users • identify “dense” networks of users (dense subgraphs) n users Aij = number of emails exchanged between users i and j during a certain time period

Document-term matrices A collection of documents is represented by an m-by-n matrix(bag-of-words model). n terms (words) • Data mining tasks • Cluster or classify documents • Find “nearest neighbors” • Feature selection: find a subset of terms that (accurately) clusters or classifies documents. m documents Aij = frequency of j-th term in i-th document

Document-term matrices A collection of documents is represented by an m-by-n matrix(bag-of-words model). n terms (words) • Data mining tasks • Cluster or classify documents • Find “nearest neighbors” • Feature selection: find a subset of terms that (accurately) clusters or classifies documents. m documents Aij = frequency of j-th term in i-th document Example later

Recommendation systems The m-by-n matrix A representsm customers andn products. products Data mining task Given a few samples from A, recommend high utility products to customers. customers Aij = utility of j-th product to i-th customer

Biology: microarray data tumour specimens Microarray Data Rows: genes (¼ 5,500) Columns: 46 soft-issue tumour specimens (different types of cancer, e.g., LIPO, LEIO, GIST, MFH, etc.) Tasks: Pick a subset of genes (if it exists) that suffices in order to identify the “cancer type” of a patient genes Nielsen et al., Lancet, 2002

Biology: microarray data tumour specimens Microarray Data Rows: genes (¼ 5,500) Columns: 46 soft-issue tumour specimens (different types of cancer, e.g., LIPO, LEIO, GIST, MFH, etc.) Tasks: Pick a subset of genes (if it exists) that suffices in order to identify the “cancer type” of a patient genes Example later Nielsen et al., Lancet, 2002

Human genetics Single Nucleotide Polymorphisms: the most common type of genetic variation in the genome across different individuals. They are known locations at the human genome where two alternate nucleotide bases (alleles) are observed (out of A, C, G, T). SNPs … AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA … individuals Matrices including hundreds of individuals and more than 300,000 SNPs are publicly available. Task :split the individuals in different clusters depending on their ancestry, and find a small subset of genetic markers that are “ancestry informative”.

Human genetics Single Nucleotide Polymorphisms: the most common type of genetic variation in the genome across different individuals. They are known locations at the human genome where two alternate nucleotide bases (alleles) are observed (out of A, C, G, T). SNPs … AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA … individuals Matrices including hundreds of individuals and more than 300,000 SNPs are publicly available. Task :split the individuals in different clusters depending on their ancestry, and find a small subset of genetic markers that are “ancestry informative”. Example later

m customers n products n products Tensors: recommendation systems • Economics: • Utility is ordinal and not cardinal concept. • Compare products; don’t assign utility values. • Recommendation Model Revisited: • Every customer has an n-by-n matrix (whose entries are +1,-1) and represent pair-wise product comparisons. • There are m such matrices, forming an n-by-n-by-m 3-mode tensor A.

128 frequencies ca. 500 pixels ca. 500 pixels Tensors: hyperspectral images Spectrally resolved images may be viewed as a tensor. Task: Identify and analyze regions of significance in the images.

Overview x • Datasets in the form of matrices (and tensors) • Matrix Decompositions • Singular Value Decomposition (SVD) • Column-based Decompositions (CX, interpolative decomposition) • CUR-type decompositions • Non-negative matrix factorization • Semi-Discrete Decomposition (SDD) • Maximum-Margin Matrix Factorization (MMMF) • Tensor decompositions • Regression • Coreset constructions • Fast algorithms for least-squares regression

The Singular Value Decomposition (SVD) Recall: data matrices have m rows (one for each object) and n columns (one for each feature). Matrix rows: points (vectors) in a Euclidean space, e.g., given 2 objects (x & d), each described with respect to two features, we get a 2-by-2 matrix. Two objects are “close” if the angle between their corresponding vectors is small.

2nd (right) singular vector 1st (right) singular vector: direction of maximal variance, 2nd (right) singular vector: direction of maximal variance, after removing the projection of the data along the first singular vector. 1st (right) singular vector SVD, intuition Let the blue circles represent m data points in a 2-D Euclidean space. Then, the SVD of the m-by-2 matrix of the data will return …

2nd (right) singular vector 1st (right) singular vector Singular values 2 1: measures how much of the data variance is explained by the first singular vector. 2: measures how much of the data variance is explained by the second singular vector. 1

0 0 Let 1¸2¸ … ¸ be the entries of . Exact computation of the SVD takes O(min{mn2 , m2n}) time. The top k left/right singular vectors/values can be computed faster using Lanczos/Arnoldi methods. SVD: formal definition : rank of A U (V): orthogonal matrix containing the left (right) singular vectors of A. S: diagonal matrix containing the singular values of A.

Rank-k approximations via the SVD  A = U VT features sig. significant noise noise = significant noise objects

Rank-k approximations (Ak) Uk (Vk): orthogonal matrix containing the top k left (right) singular vectors of A. Sk: diagonal matrix containing the top k singular values of A.

PCA and SVD Principal Components Analysis (PCA) essentially amounts to the computation of the Singular Value Decomposition (SVD) of a covariance matrix. SVD is the algorithmic tool behind MultiDimensional Scaling (MDS) and Factor Analysis. SVD is “the Rolls-Royce and the Swiss Army Knife of Numerical Linear Algebra.”* *Dianne O’Leary, MMDS ’06

Ak as an optimization problem Frobenius norm: Given , it is easy to find X from standard least squares. However, the fact that we can find the optimal  is intriguing!

Ak as an optimization problem Frobenius norm: Given , it is easy to find X from standard least squares. However, the fact that we can find the optimal  is intriguing! Optimal  = Uk, optimal X = UkTA.

LSI: Ak for document-term matrices(Berry, Dumais, and O'Brien ’92) Latent Semantic Indexing (LSI) Replace A by Ak; apply clustering/classification algorithms on Ak. n terms (words) • Pros • Less storage for small k. • O(km+kn) vs. O(mn) • Improved performance. • Documents are represented in a “concept” space. m documents Aij = frequency of j-th term in i-th document

LSI: Ak for document-term matrices(Berry, Dumais, and O'Brien ’92) Latent Semantic Indexing (LSI) Replace A by Ak; apply clustering/classification algorithms on Ak. n terms (words) • Pros • Less storage for small k. • O(km+kn) vs. O(mn) • Improved performance. • Documents are represented in a “concept” space. • Cons • Ak destroys sparsity. • Interpretation is difficult. • Choosing a good k is tough. m documents Aij = frequency of j-th term in i-th document

Ak and k-means clustering(Drineas, Frieze, Kannan, Vempala, and Vinay ’99) k-means clustering A standard objective function that measures cluster quality. (Often denotes an iterative algorithm that attempts to optimize the k-means objective function.) k-means objective Input: set of m points in Rn, positive integer k Output:a partition of the m points to k clusters Partition the m points to k clusters in order to minimize the sum of the squared Euclidean distances from each point to its cluster centroid.

k-means, cont’d We seek to split the input points in 5 clusters.

k-means, cont’d We seek to split the input points in 5 clusters. The cluster centroid is the “average” of all the points in the cluster.

k-means: a matrix formulation Let A be the m-by-n matrix representing m points in Rn. Then, we seek to X is a special “cluster membership” matrix: Xij denotes if the i-th point belongs to the j-th cluster.

k-means: a matrix formulation Let A be the m-by-n matrix representing m points in Rn. Then, we seek to X is a special “cluster membership” matrix: Xij denotes if the i-th point belongs to the j-th cluster. clusters • Columns of X are normalized to have unit length. • (We divide each column by the square root of the number of points in the cluster.) • Every row of X has at most one non-zero element. • (Each element belongs to at most one cluster.) • X is an orthogonal matrix, i.e., XTX = I. points

SVD and k-means If we only require that X is an orthogonal matrix and remove the condition on the number of non-zero entries per row of X, then is easy to minimize! The solution is X = Uk.

SVD and k-means If we only require that X is an orthogonal matrix and remove the condition on the number of non-zero entries per row of X, then is easy to minimize! The solution is X = Uk. • Using SVD to solve k-means • We can get a 2-approximation algorithm for k-means. • (Drineas, Frieze, Kannan, Vempala, and Vinay ’99, ’04) • We can get heuristic schemes to assign points to clusters. • (Zha, He, Ding, Simon, and Gu ’01) • There exist PTAS (based on random projections) for the k-means problem. • (Ostrovsky and Rabani ’00, ’02) • Deeper connections between SVD and clustering in Kannan, Vempala, and Vetta ’00, ’04.

Ak and Kleinberg’s HITS algorithm(Kleinberg ’98, ’99) Hypertext Induced Topic Selection (HITS) A link analysis algorithm that rates Web pages for their authority and hub scores. Authority score: an estimate of the value of the content of the page. Hub score: an estimate of the value of the links from this page to other pages. These values can be used to rank Web search results.

Ak and Kleinberg’s HITS algorithm Hypertext Induced Topic Selection (HITS) A link analysis algorithm that rates Web pages for their authority and hub scores. Authority score: an estimate of the value of the content of the page. Hub score: an estimate of the value of the links from this page to other pages. These values can be used to rank Web search results. Authority: a page that is pointed to by many pages with high hub scores. Hub: a page pointing to many pages that are good authorities. Recursive definition; notice that each node has two scores.

Ak and Kleinberg’s HITS algorithm Phase 1: Given a query term (e.g., “jaguar”), find all pages containing the query term (root set). Expand the resulting graph by one move forward and backward (base set).

Ak and Kleinberg’s HITS algorithm Phase 2: Let A be the adjacency matrix of the (directed) graph of the base set. Let h , a 2 Rn be the vectors of hub (authority) scores. Then, h = Aa and a = ATh h = AAThanda = ATAa.

Ak and Kleinberg’s HITS algorithm Phase 2: Let A be the adjacency matrix of the (directed) graph of the base set. Let h , a 2 Rn be the vectors of hub (authority) scores. Then, h = Aa and a = ATh h = AAThanda = ATAa. Thus, the top left (right) singular vector of A corresponds to hub (authority) scores.

Ak and Kleinberg’s HITS algorithm Phase 2: Let A be the adjacency matrix of the (directed) graph of the base set. Let h , a 2 Rn be the vectors of hub (authority) scores. Then, h = Aa and a = ATh h = AAThanda = ATAa. Thus, the top left (right) singular vector of A corresponds to hub (authority) scores. What about the rest? They provide a natural way to extract additional densely linked collections of hubs and authorities from the base set. See the “jaguar” example in Kleinberg ’99.

SVD example: microarray data genes Microarray Data (Nielsen et al., Lancet, 2002) Columns: genes (¼ 5,500) Rows: 32 patients, three different cancer types (GIST, LEIO, SynSarc)

SVD example: microarray data Microarray Data Applying k-means with k=3 in this 3D space results to 3 misclassifications. Applying k-means with k=3 but retaining 4 PCs results to one misclassification. Can we find actual genes (as opposed to eigengenes) that achieve similar results?

SVD example: ancestry-informative SNPs Single Nucleotide Polymorphisms: the most common type of genetic variation in the genome across different individuals. They are known locations at the human genome where two alternate nucleotide bases (alleles) are observed (out of A, C, G, T). SNPs … AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA … individuals There are¼ 10 million SNPs in the human genome, so this table could have ~10 million columns.

Focus at a specific locus and assay the observed nucleotide bases (alleles). SNP: exactly two alternate alleles appear. Two copies of a chromosome (father, mother)

Information Retrieval & Data Mining: A Linear Algebraic Perspective