1 / 28

Information Retrieval in Text Part III

Information Retrieval in Text Part III. Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval . SIAM 1999. Reading Assignment: Chapter 4. Outline. Matrix Decompositions QR Factorization Singular Value Decomposition

poppy
Télécharger la présentation

Information Retrieval in Text Part III

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Retrieval in TextPart III • Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval. SIAM 1999. • Reading Assignment: Chapter 4.

  2. Outline • Matrix Decompositions • QR Factorization • Singular Value Decomposition • Updating Techniques

  3. Matrix Decomposition • To produce a reduced-rank approximation of the mn term by document matrix A, one must identify the dependence between columns or rows of the matrix A. • For a rank-k matrix, the k basis vectors of its column space serve in place of its n column vectors to represent its column space.

  4. QR Factorization • The QR factorization of matrix A is defined as where Q is an m  m orthogonal matrix • A square matrix is orthogonal if its columns are orthonormal. i.e., if qj denotes a column of the orthogonal matrix Q, then qj has unit Euclidean norm (|| qj ||2 = 1) for j = 1,2, …, m and it is orthogonal to all other columns of Q ((qjTqi)1/2 = 0 for all i ≠ j). • The rows of Q are also orthonormal, i.e. QTQ = QQT = I. • Such factorization exists for any matrix A. • There are many ways to do the factorization.

  5. QR Factorization • Given A = QR, the columns of the matrix A are all linear combinations of the columns of Q. • Thus, a subset of k of the columns of Q form a basis for the column space of A, where k = rank(A)

  6. QR Factorization: Example

  7. QR Factorization: Example

  8. QR Factorization: Example • QR Factorization of the previous example can be represented as • Note that the first 7 columns of Q, Q1, are orthonormal • And hence constitute a basis for the column space of A. • The bottom zero submatrix of R is not always guaranteed to be generated automatically from the QR factorization, and hence may need to apply column pivoting in order to guarantee the zero submatrix. • Q2 does not contribute to producing any nonzero value in A

  9. QR Factorization • One motivation for using QR factorization is that the basis vectors can be used to describe the semantic content of the corresponding text collection. • The cosines of the anglesj between a query vector q and document vectors aj • Note that for the query “Child Proofing” it gives exactly the same cosines. Why?

  10. Frobenius Matrix Norm • Definition: The Frobenius matrix norm of an mn matrix B = [bij], ||.||F is defined by

  11. Low Rank Approximation for QR Factorization • Initially, the rank of A is not known. However, after performing the QR factorization, its rank is obviously the rank of _______ • With column pivoting, we know that there exists a permutation matrix P such that AP = QR where the larger entries of R are moved to the upper left corner. Such arrangement, if possible, partitions R where the smallest entries are isolated in the bottom submatrix.

  12. Low Rank Approximation for QR Factorization

  13. Low Rank Approximation for QR Factorization • Computing • Redefining R22 to be the 42 zero matrix, the modified upper triangular matrix R has rank 5 rather than 7. • Hence, the matrix has rank ____ • Show that ||E||F = ||R22||F. • Show that ||E||F/ ||A||F = || R22 ||F / ||R||F = 0.3237 • Therefore, the relative change in R, 32.37%, yields the same relative change in A. • With r=4, the relative change is 76%.

  14. Low Rank Approximation for QR Factorization: Example

  15. Comparing Cosine Similarities for the Query: “Child Proofing”

  16. Comparing Cosine Similarities for the Query: “Child Home Safety”

  17. Singular Value Decomposition • While QR factorization provides a reduced rank basis for the column space, no information is provided about the row space of A. • SVD can provide • reduced rank approximation for both spaces • rank-k approximation to A of minimal change for any value of k.

  18. Singular Value Decomposition • A = UVT where U: mm orthogonal matrix whose columns define the left singular vectors of A V: nn orthogonal matrix whose columns define the right singular vectors of A : mn diagonal matrix containing singular values 1 2  …  min{m,n} • Such factorization exists for any matrix A.

  19. Component Matrices of the SVD

  20. SVD vs. QR • What is the relationship between the rank of A and the ranks of the matrices in both factorizations? • In QR, the first rA columns of Q form a basis for the column space, so do the first rA columns of U. The first rA rows of VT form a basis for the row space of A. • The low rank-k approximation in SVD can be done by setting all but the k largest singular values in  to zero.

  21. SVD • Theorem: The low rank-k approximation of SVD is the closest rank-k approximation to A • Proven by Eckart and Young • It showed that the error in approximating A by Ak is given by where Ak = UkkVkT • Hence, the error in approximating the original matrix is determined by singular values (k+1,k+2,…,rank(A))

  22. SVD: Example

  23. SVD: Example

  24. SVD: Example • ||A – A6||F = …… • Hence, the relative change in the matrix A is … • Therefore, rank-5 approximation may be appropriate in our case. • Determining the best rank approximation for any database depends on empirical testing • For very large databases, the number could be between 100 and 300. • Computational feasibility, rather than accuracy, determines the rank reduction

  25. Low Rank Approximations • Visual comparison of rank-reduced approximations to A can be misleading • Check rank-4 QR approximation vs. the more accurate rank-4 SVD approximation. • Rank-4 SVD approximation shows associations made with terms, not originally in the document title • e.g. Term 4 (Health) and Term 8 (Safety) in Document 1 (Infant & Toddler First Aid).

  26. Query Matching • Given the query vector q, to be compared with the columns of the reduced-rank matrix Ak. • Let ej denotes the jth canonical vector in In. Then, Akej represents _______________ • It is easy to show that where

  27. Query Matching • An alternate formula for the cosine computation is • Note that which means that the number of retrieved documents using this query matching technique is larger.

More Related