Matrix Decomposition Methods in Information Retrieval

Matrix Decomposition Methods in Information Retrieval Thomas Hofmann Department of Computer Science Brown University www.cs.brown.edu/people/th (& Chief Scientist, RecomMind Inc.) In collaboration with: Jan Puzicha, UC Berkeley & RecomMind David Cohen, CMU & Burning Glass

Overview • Introduction: A Brief History of Mechanical IR • Latent Semantic Analysis • Probabilistic Latent Semantic Analysis • Learning (from) Hyperlink Graphs • Collaborative Filtering • Future Work and Conclusion

Introduction: A Brief History of Mechanical IR 3

Memex – “As we may think.” Vannevar Bush (1945) • The idea of an easily accessible, individually configurable storehouse of knowledge, the beginning of the literature on mechanized information retrieval: • “Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and to coin one at random, ‘memex’ will do. A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory.” • “The world has arrived at an age of cheap complex devices of great reliability; and something is bound to come of it.”

Memex – “As we may think.” Vannevar Bush (1945) • The civilizational challenge: • “The difficulty seems to be, not so much that we publish unduly in view of the extent and variety of present day interests, but rather that publication has been extended far beyond our present ability to make real use of the record. The summation of human experience is being expanded at a prodigious rate, and the means we use for threading through the consequent maze to the momentarily important item is the same as was used in the days of square-rigged ships.” V. Bush, “As we may think”, Atlantic Monthly, 176 (1945), pp.101-108

The Thesaurus Approach Hans Peter Luhn (1957, 1961) • Words of similar or related meaning are grouped into “notional families” • Encoding of documents in terms of notional elements • Matching by measuring the degree of notional similarity • A common language for annotating documents, key word in context (KWIC) indexing. • “… the faculty of interpretation is beyond the talent of machines.” • Statistical cues extracted by machines to assist human indexer; vocabulary method to detecting similarities. H.P. Luhn, “A statistical approach to mechanical literature searching”, New York, IBM Research Center, 1957. H.P. Luhn, “The Automatic Derivation of Information Retrieval Encodements from Machine- Readable Text”, Information Retrieval and Machine Translation, 3(2), pp.1021-1028, 1961

To Punch or not to punch … T. Joyce & R.M. Needham (1958) • Lattices & hierarchies of search terms • “As in other systems, the documents are represented by holes in punched cards which represent the various terms, and in addition, when a hole is punched in any term card, all the terms at higher levels of the lattice […] are also punched.” • The postcoordinate revolution: card sorting at search time! • “Investigations […] to lessen the physical work are continuing.” T. Joyce & R.M. Needham, “The Thesaurus Approach to Information Retrieval”, American Documentation, 9, pp. 192-197, 1958.

Term Associations • Lauren B. Doyle (1962) • Unusual co-occurrences of pairs of words = associations of words in text • Statistical testing: Chi-square and Pearson correlation coefficient to determine pairwise correlations • Term association maps for interactive retrieval • Today: semantic maps L.B. Doyle, “Indexing and Abstracting by Association”, Unisys Corporation, 1962.

Vector Space Model Gerard Salton (1960/70) • Instead of indexing documents by selected index terms, preserve (almost) all terms in automatic indexing • Represent documents by a high-dimensional vector. • Each term can be associated with a weight • Geometrical interpretation G. Salton, “The SMART Retrieval System – Experiments in Automatic Document Processing”, 1971.

X term weighting Term-Document Matrix W = {terms in vocabulary} D = {documents in database} intelligence Texas Instruments said it has developed the first 32-bit computer chip designed specifically for artificial intelligence applications [...] term-document matrix intelligence artificial interest artifact t d ... 0 1 ... 2 0 ... =

Documents in “Inner” Space • Retrieval method • rank documents according to similarity with query • term weighting schemes, for example, TFIDF • used in SMART system and many successor systems, high popularity similarity between document and query cosine of angle between query and document(s)

Advantages of the Vector Space Model • No subjective selection of index terms • Partial matching of queries and documents (dealing with the case where no document contains all search terms) • Ranking according to similarity score (dealing with large result sets) • Term weighting schemes (improves retrieval performance) • Various extensions • Document clustering • Relevance feedback (modifying query vector) • Geometric foundation

2. Latent Semantic Analysis 14

Limitations of the Vector Space Model • Dimensionality: • Vector space representation is high-dimensional (several 10-100K). • Learning and estimation has to deal with curse of dimensionality. • Sparseness: • Document vectors are typically very sparse. • Cosine similarity can be noisy and inaccurate. • Semantics: • The inner product can only match occurrences of exactly the same terms. • The vector representation does not capture semantic relations between words. • Independence • Bag-of-Words Representation • Unable to capture phrases and semantic/syntactic regularities

The Lost Meaning of Words … • Ambiguity and association in natural language • Polysemy: Words often have a multitude of meanings and different types of usage (more urgent for very heterogeneous collections). • The vector space model is unable to discriminate between different meanings of the same word. • Synonymy: Different terms may have an identical or a similar meaning (weaker: words indicating the same topic). • No associations between words are made in the vector space representation.

ring jupiter ••• space voyager … saturn ... meaning 1 … planet ... car company ••• dodge ford meaning 2 contribution to similarity, if used in 1st meaning, but not if in 2nd Polysemy and Context • Document similarity on single word level: polysemy and context

Latent Semantic Analysis • General idea • Map documents (and terms) to a low-dimensional representation. • Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space). • Compute document similarity based on the inner product in the latent semantic space. • Goals • Similar terms map to similar location in low dimensional space. • Noise reduction by dimension reduction.

LSA: Matrix Decomposition by SVD • Dimension reduction by singular value decomposition of term-document matrix word frequencies (possibly transformed) • Document length normalization • Sublinear transformation (e.g., log) • Global term weight original td matrix reconstructed td matrix term/document vectors thresholded singular values L2 optimal approximation

= = X X X X n X m n X m n X n n X k n X n k X k n X m k X m Background: SVD • Singular Value Decomposition, definition • : orthonormal columns • : diagonal with singular values (ordered) • Properties: • Existence & uniqueness • Thresholding small singular values yields an optimal low-rank approximation (in the sense of the Frobenius norm)

SVD and PCA • If (!) the rows of would be shifted such that their mean is zero, then: • Then, one would essentially perform a projection on the principal axis defined by the columns of • Yet, this would destroy the sparseness of the term-document matrix (and consequently might hurt the performance of SVD methods)

constraints Canonical Analysis Hirschfield 1935, Hotelling 1936, Fisher 1940: Correlation analysis for contingency tables

Canoncial & Correspondence Analysis Correspondence Analysis (as a method of scaling): Guttman 1941, Torgerson 1958, Benzecri 1969, Hill 1974 Whitaker 1967: “gradient analysis” “reciprocal averaging” • solutions: unit vectors and scores of canonical analysis • SVD of rescaled matrix with entries (not exactly what is done in LSA)

Semantic Inner Product / Kernel • Similarity: inner product in lower dimensional space • For given decomposition, additional documents or queries can be mapped to semantic space (folding-in) • Notice that: • Hence, for new document/query q lower dimensional document representation

Term Associations from LSA Term 2 Concept Term 1 (taken from slide by S. Dumais)

LSA: Discussion • pros: • Low-dimensional document representation is able to capture synonyms. • Noise removal and robustness by dimension reduction • Experimentally: advantages over naïve vector space model • cons: • “Formally”: L2 norm is inappropriate as a distance function for count vectors (reconstruction may contain negative entries) • “Conceptually”: • Problem of polysemy is not addressed; principle of linear superposition, no active disambiguation • Context of terms is not taken into account. • Directions in latent space are hard to interpret. • No probabilistic model of term occurrences. • [ad hoc selection of the number of dimensions, ...]

Features of IR Methods

3. Probabilistic Latent Semantic Analysis 28

sample Documents as Information Sources • “real” document: empirical probability distrib.  relative frequencies D = {documents in database} W = {words in vocabulary} • “ideal” document: (memoryless) information source other documents

Information Source Models in IR • Bayes rule: probability of relevance of document w.r.t. query prior probability of relevance • Query translation model • Probability that q is “generated” from d • Probability that query term is generated Translation model Language model J. Ponte & W.B. Croft, ”A Language Model Approach to Information Retrieval”, SIGIR 1998. A. Berger & J. Lafferty, “Information Retrieval as Statistical Translation, SIGIR 1999.

document “sources” (topic) factor “sources” document-specific mixing proportions latent variable z (“small” #states) Probabilistic Latent Semantic Analysis • How can we learn document-specific language models? Sparseness problem, even for unigrams. • Probabilistic dimension reduction techniques to overcome data sparseness problem. • Factor analysis for count data: factors  concepts T. Hofmann, “Probabilistic Latent Semantic Analysis”, UAI 1999.

single document in collection word occurrences in a document z w document collection c(d) PLSA: Graphical Model

collection PLSA: Graphical Model P(z|d) z w c(d) N

P(w|z) PLSA: Graphical Model P(z|d) z w c(d) N

P(w|z) PLSA: Graphical Model shared by all words in a document P(z|d) shared by all documents in collection z w c(d) N

Probabilistic Latent Semantic Space • documents are represented as points in low- dimensional sub-simplex (dimensionality reduction for probability distributions) embedding spanned simplex + sub-simplex 0 • KL-divergence projection, not orthogonal

Positive Matrix Decomposition • mixture decomposition in matrix notation • constraints • Non-negativity of all matrices • Normalization according to L1-norm • (no orthogonality) D.D. Lee & H.S. Seung, “Learning the parts of objects by non-negative matrix factorization”, Nature, 1999.

Positive Matrix Decomposition & SVD • mixture decomposition in matrix notation compare to • probabilistic approach vs. linear algebra decomposition • conditional independence assumption “replaces” outer product • class-conditional distributions “replace” left/right eigenvectors • maximum likelihood instead of minimum L2 norm criterion

Expectation Maximization Algorithm • Maximizing log-likelihood by (tempered) EM iterations • E-step (posterior probabilities of latent variables) • M-step (max. of expected complete log-likelihood) probability that a term occurrence w within d is explained by topic z“

Example: Science Magazine Papers • Dataset with approx. 12K papers from Science Magazine • Selected concepts from model with K=200

Example: TDT1 news stories • TDT1 = document collection with approx. 16,000 news stories (Reuters, CNN, years 1994/95) • results based on decomposition with 128 concepts • 2 main factors for “flight“ and “love“ (most probable words) “flight” “love” home family like just kids mother life happy friends cnn film movie music new best hollywood love actor entertainment star plane airport crash flight safety aircraft air passenger board airline space shuttle mission astronauts launch station crew nasa satellite earth probability P(w|z)

Folding-in a Document/Query • TDT1 collection: approx. 16,000 news stories • PLSA model with 128 dimensions • Query keywords: “aid food medical people UN war” • 4 most probable factors for query • Track posteriors for every key word iraq iraqui sanctions kuwait un council gulf saddam baghdad hussein resolution border refugees aid rwanda relief people camps zaire camp food rwandan un goma building city people rescue buildings workers kobe victims area earthquake disaster missing un bosnian serbs bosnia serb sarajevo nato peacekeep. nations peace bihac war 4 selected factors with their most probable keywords

1 1 1 1 aid food 0.8 0.8 0.8 0.8 medical people 0.6 0.6 0.6 0.6 un war 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 Folding-in a Document/Query iraq iraqui sanctions kuwait un council gulf saddam baghdad hussein resolution border refugees aid rwanda relief people camps zaire camp food rwandan un goma building city people rescue buildings workers kobe victims area earthquake disaster missing un bosnian serbs bosnia serb sarajevo nato peacekeep. nations peace bihac war Iteration 1 Posterior probabilites

1 1 1 1 1 1 1 1 aid food 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 medical people 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 un war 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 Folding-in a Document/Query iraq iraqui sanctions kuwait un council gulf saddam baghdad hussein resolution border refugees aid rwanda relief people camps zaire camp food rwandan un goma building city people rescue buildings workers kobe victims area earthquake disaster missing un bosnian serbs bosnia serb sarajevo nato peacekeep. nations peace bihac war Iteration  Posterior probabilites

Experiments: Precison-Recall 4 test collections (each with approx.1000- 3500 docs)

Experimental Results: TFIDF Average Precision-Recall

Experimental Results: TFIDF Relative Gain in Average PR

From Probabilistic Models to Kernels: The Fisher Kernel • Use idea of a Fisher kernel: • Main idea: Derive a kernel or similarity function from a generative model • How do ML estimates of parameters change, around a point in sample space? • Derive Fisher scores from model • Kernel/similarity function T. Jaakkola & D. Haussler, “Exploiting Generative Models for Discriminative Training”, NIPS 1999.

Semantic Kernel from PLSA: Outline Outline of the technical derivation • Parameterize multinomials by variance stabilizing parameters (=square-root parameterization) • Assume information orthogonality of parameters for different multinomials (approximation). • In each block, an isometric embedding with constant Fisher information is obtained. (Inversion problem for information matrix is circumvented) • … and the result …

Matrix Decomposition Methods in Information Retrieval

Matrix Decomposition Methods in Information Retrieval

Presentation Transcript

Distance Matrix Methods

Matrix Decomposition and its Application in Statistics

Statistical Learning Methods for Information Retrieval

Information Retrieval: Models and Methods

Systolic 4x4 Matrix QR Decomposition

Research methods: information retrieval, evaluation and integration

Matrix Factorizations: Singular Value Decomposition

Information Retrieval in Practice

Parallel Decomposition Methods

Information Retrieval in Context

An Introduction To Matrix Decomposition

Information Retrieval through Various Approximate Matrix Decompositions

Sparse Matrix Methods

Probabilistic Models in Information Retrieval SI650: Information Retrieval

Skills in Information Retrieval

Evaluation in Information Retrieval

Information Retrieval through Various Approximate Matrix Decompositions

Information Retrieval through Various Approximate Matrix Decompositions

Sparse Matrix Methods

Decomposition Methods