130 likes | 262 Vues
This paper discusses the limitations of traditional term-matching methods in information retrieval and introduces Latent Semantic Analysis (LSA) as a solution. LSA aims to identify concepts rather than relying solely on terms, addressing ambiguities and synonymy inherent in language. The paper explains how Singular Value Decomposition (SVD) is employed to discover latent semantic structures within data. By redefining term-document relationships, LSA improves the retrieval quality, allowing for more effective information access and retrieval outcomes.
E N D
Paper: Indexing by Latent Semantic Analysis for course cs630 presented by: Haiyan Qiao
Problem Introduction • Traditional term-matching method doesn’t work well in information retrieval • We want to capture the concepts instead of words. Concepts are reflected in the words. However, • One term may have multiple meaning • Different terms may have the same meaning.
LSI (Latent Semantic Analysis) • LSI approach tries to overcome the deficiencies of term-matching retrieval by treating the unreliability of observed term-document association data as a statistical problem. • The goal is to find effective models to represent the relationship between terms and documents. Hence a set of terms, which is by itself incomplete and unreliable, will be replaced by some set of entities which are more reliable indicants.
SVD (Singular Value Decomposition) • How to learn the concepts from data? • SVD is applied to derive the latent semantic structure model. • What is SVD? http://kwon3d.com/theory/jkinem/svd.html http://mathworld.wolfram.com/SingularValueDecomposition.html http://www.cs.ut.ee/~toomas_l/linalg/lin2/node13.html#SECTION00013200000000000000
SVD cont’ • SVD of the term-by-document matrix X: • If the singular values of S0 are ordered by size, we only keep the first k largest values and get a reduced model: • doesn’t exactly match X and it gets closer as more and more singular values are kept • This is what we want. We don’t want perfect fit since we think some of 0’s in X should be 1 and vice versa. • It reflects the major associative patterns in the data, and ignores the smaller, less important influence and noise.
Fundamental Comparison Quantities from the SVD Model • Comparing Two Terms: the dot product between two row vectors of reflects the extent to which two terms have a similar pattern of occurrence across the set of document. • Comparing Two Documents: dot product between two column vectors of • Comparing a Term and a Document
Example -Technical Memo • Query: human-computer interaction • Dataset: c1 Human machine interface for Lab ABC computer application c2 A survey of user opinion of computersystemresponsetime c3 The EPS user interface management system c4 System and humansystem engineering testing of EPS c5 Relations of user-perceived responsetime to error measurement m1 The generation of random, binary, unordered trees m2 The intersection graph of paths in trees m3 Graphminors IV: Widths of trees and well-quasi-ordering m4 Graphminors: A survey
Example cont’ % 12-term by 9-document matrix >> X=[ 1 0 0 1 0 0 0 0 0; 1 0 1 0 0 0 0 0 0; 1 1 0 0 0 0 0 0 0; 0 1 1 0 1 0 0 0 0; 0 1 1 2 0 0 0 0 0 0 1 0 0 1 0 0 0 0; 0 1 0 0 1 0 0 0 0; 0 0 1 1 0 0 0 0 0; 0 1 0 0 0 0 0 0 1; 0 0 0 0 0 1 1 1 0; 0 0 0 0 0 0 1 1 1; 0 0 0 0 0 0 0 1 1;];
Example cont’ % X=T0*S0*D0', T0 and D0 have orthonormal columns and So is diagonal % T0 is the matrix of eigenvectors of the square symmetric matrix XX' % D0 is the matrix of eigenvectors of X’X % S0 is the matrix of eigenvalues in both cases >> [T0, S0] = eig(X*X'); >> T0 T0 = 0.1561 -0.2700 0.1250 -0.4067 -0.0605 -0.5227 -0.3410 -0.1063 -0.4148 0.2890 -0.1132 0.2214 0.1516 0.4921 -0.1586 -0.1089 -0.0099 0.0704 0.4959 0.2818 -0.5522 0.1350 -0.0721 0.1976 -0.3077 -0.2221 0.0336 0.4924 0.0623 0.3022 -0.2550 -0.1068 -0.5950 -0.1644 0.0432 0.2405 0.3123 -0.5400 0.2500 0.0123 -0.0004 -0.0029 0.3848 0.3317 0.0991 -0.3378 0.0571 0.4036 0.3077 0.2221 -0.0336 0.2707 0.0343 0.1658 -0.2065 -0.1590 0.3335 0.3611 -0.1673 0.6445 -0.2602 0.5134 0.5307 -0.0539 -0.0161 -0.2829 -0.1697 0.0803 0.0738 -0.4260 0.1072 0.2650 -0.0521 0.0266 -0.7807 -0.0539 -0.0161 -0.2829 -0.1697 0.0803 0.0738 -0.4260 0.1072 0.2650 -0.7716 -0.1742 -0.0578 -0.1653 -0.0190 -0.0330 0.2722 0.1148 0.1881 0.3303 -0.1413 0.3008 0.0000 0.0000 0.0000 -0.5794 -0.0363 0.4669 0.0809 -0.5372 -0.0324 -0.1776 0.2736 0.2059 0.0000 0.0000 0.0000 -0.2254 0.2546 0.2883 -0.3921 0.5942 0.0248 0.2311 0.4902 0.0127 -0.0000 -0.0000 -0.0000 0.2320 -0.6811 -0.1596 0.1149 -0.0683 0.0007 0.2231 0.6228 0.0361 0.0000 -0.0000 0.0000 0.1825 0.6784 -0.3395 0.2773 -0.3005 -0.0087 0.1411 0.4505 0.0318
Example cont’ >> [D0, S0] = eig(X'*X); >> D0 D0 = 0.0637 0.0144 -0.1773 0.0766 -0.0457 -0.9498 0.1103 -0.0559 0.1974 -0.2428 -0.0493 0.4330 0.2565 0.2063 -0.0286 -0.4973 0.1656 0.6060 -0.0241 -0.0088 0.2369 -0.7244 -0.3783 0.0416 0.2076 -0.1273 0.4629 0.0842 0.0195 -0.2648 0.3689 0.2056 0.2677 0.5699 -0.2318 0.5421 0.2624 0.0583 -0.6723 -0.0348 -0.3272 0.1500 -0.5054 0.1068 0.2795 0.6198 -0.4545 0.3408 0.3002 -0.3948 0.0151 0.0982 0.1928 0.0038 -0.0180 0.7615 0.1522 0.2122 -0.3495 0.0155 0.1930 0.4379 0.0146 -0.5199 -0.4496 -0.2491 -0.0001 -0.1498 0.0102 0.2529 0.6151 0.0241 0.4535 0.0696 -0.0380 -0.3622 0.6020 -0.0246 0.0793 0.5299 0.0820
Example cont’ >> S0=eig(X'*X) >> S0=S0.^0.5 S0 = 0.3637 0.5601 0.8459 1.3064 1.5048 1.6445 2.3539 2.5417 3.3409 % We only keep the largest two singular values % and the corresponding columns from the T and D
Example cont’ >> T=[0.2214 -0.1132; 0.1976 -0.0721; 0.2405 0.0432; 0.4036 0.0571; 0.6445 -0.1673; 0.2650 0.1072; 0.2650 0.1072; 0.3008 -0.1413; 0.2059 0.2736; 0.0127 0.4902; 0.0361 0.6228; 0.0318 0.4505;]; >> S = [ 3.3409 0; 0 2.5417 ]; >> D’ =[0.1974 0.6060 0.4629 0.5421 0.2795 0.0038 0.0146 0.0241 0.0820; -0.0559 0.1656 -0.1273 -0.2318 0.1068 0.1928 0.4379 0.6151 0.5299;] >> T*S*D’ 0.1621 0.4006 0.3790 0.4677 0.1760 -0.0527 0.1406 0.3697 0.3289 0.4004 0.1649 -0.0328 0.1525 0.5051 0.3580 0.4101 0.2363 0.0242 0.2581 0.8412 0.6057 0.6973 0.3924 0.0331 0.4488 1.2344 1.0509 1.2658 0.5564 -0.0738 0.1595 0.5816 0.3751 0.4168 0.2766 0.0559 0.1595 0.5816 0.3751 0.4168 0.2766 0.0559 0.2185 0.5495 0.5109 0.6280 0.2425 -0.0654 0.0969 0.5320 0.2299 0.2117 0.2665 0.1367 -0.0613 0.2320 -0.1390 -0.2658 0.1449 0.2404 -0.0647 0.3352 -0.1457 -0.3016 0.2028 0.3057 -0.0430 0.2540 -0.0966 -0.2078 0.1520 0.2212
Summary • What is the common and difference between PCA and SVD? • Both are related to standard eigenvalue-eigenvector, to remove noise or correlation and get the most important info. • PCA is on covariance matrix and SVD works on original matrix.