1 / 48

Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI)

Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI). Jasminka Dob š a Faculty of organization and informatics, Vara ž din. Outline. Information retrieval in vector space model (VSM) or bag of words representation

riona
Télécharger la présentation

Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics, Varaždin

  2. Outline • Information retrieval in vector space model (VSM) or bag of words representation • Techniques for conceptual indexing • Latent semantic indexing • Concept indexing • Comparison: Academic example • Experiment • Further work

  3. Information retrieval in VSM 1/3 • Task of information retrieval: to extract documents that are relevant for user query in document collection • In VSM documents are presented in high dimensional space • Dimension of space depends on number of indexing terms which are chosen to be relevant for the collection (4000-5000 in my experiments) • VSM is implemented by forming term-document matrix

  4. Term-document matrix is mn matrix wheremis number of terms and n is number of documents row of term-document matrix = term column of term-document matrix = document Figure 1. Term-document matrix Information retrieval in VSM 2/3

  5. Information retrieval in VSM 3/3 • query has the same shape as document (m dimensional vector) • measure of similarity between query q and a document ajis acosine of angle between those two vectors

  6. Measures for evaluation: Recall Precision Average precision Recall Precision riis number of relevant documents among i highest ranked documents rnis total number of relevant documents in collection Average precision – average precision for distinct levels of recall Retrieval performance evaluation

  7. Techniquesfor conceptual indexing • In term-matching method similarity between query and the document is tested lexically • Polysemy (words having multiple meaning) and synonymy (multiple words having the same meaning) are two fundamental problems in efficient information retrieval • Here we compare two techniques for conceptual indexing based on projection of vectors of documents (in means of least squares) on lower-dimensional vector space • Latent semantic indexing (LSI) • Concept indexing (CI)

  8. Latent semantic indexing • Introduced in 1990; improved in 1995 • S. Deerwester, S. Dumas, G. Furnas, T. Landauer, R. Harsman: Indexing by latent semantic analysis, J. American Society for Information Science, 41, 1990, pp. 391-407 • M. W. Berry, S.T. Dumas, G.W. O’Brien: Using linear algebra for intelligent information retrieval, SIAM Review, 37, 1995, pp. 573-595 • Based on spectral analysis of term-document matrix

  9. Latent semantic indexing • For every m×n matrix A there is singular value decomposition (SVD) U orthogonal m×m matrix whose columns are left singular vectors of A  diagonal matrix on whose diagonal are singular values of matrix A in descending order V orthogonal n×n matrix whose columns are right singular vectors of A

  10. Latent semantic indexing • For LSI truncated SVD is used where Ukis m×k matrix whose columns are first k left singular vectors of A k is k×k diagonal matrix whose diagonal is formed by k leading singular values of A Vkis n×k matrix whose columns are first k right singular vectors of A • Rows of Uk =terms • Rows of Vk= documents

  11. (Truncated) SVD

  12. Latent semantic indexing • Using the truncated LSI we include only first k independent linear components of A (singular vectors and values) • Documents are projected in means of least squares on space spread by first k singular vectors of A (LSI space) • First k components capture the major associational structure in in the term-document matrix and throw out the noise • Minor differences in terminology used in documents are ignored • Closeness of objects (queries and documents) is determined by overall pattern of term usage, so it is context based • Documents which contain synonyms are closer in LSI space than in original space; documents which contain polysemy in different context are more far in LSI space than in original space

  13. Concept indexing (CI) • Indexing using concept decomposition (CD) instead of SVD like in LSI • Concept decomposition was introduced in 2001 I.S.Dhillon, D.S. Modha: Concept decomposition for large sparse text data using clustering, Machine Learning, 42:1, 2001, pp. 143-175

  14. Concept decomposition • First step: clustering of documents in term-document matrix A on k groups • Clustering algorithms: • Spherical k-means algorithm • Fuzzy k-means algorithm • Spherical k-means algorithm is a variant of k-means algorithm which uses the fact that vectors of documents are of the unit norm • Centroids of groups = concept vectors • Concept matrix is matrix whose columns are centroids of groups cj – centroid of j-th group

  15. Concept decomposition • Second step: calculating the concept decomposition • Concept decomposition Dk of term-document matrix A is least squares approximation of A on the space of concept vectors where Z is solution of the least squares problem • Rows of Ck = terms • Columns of Z = documents

  16. Comparison: Academic example • Collection of 15 documents (Titles of books) • 9 from the field of data mining • 5 from the field of linear algebra • 1 combination of these fields (application of linear algebra for data mining) • List of terms was formed • By words contained in at least two documents • Words on stop list were ejected • Stemming was performed • On term-document matrix we apply • Truncated SVD (k=2) • Concept decomposition (k=2)

  17. Documents 1/2

  18. Documents 2/2

  19. Terms

  20. Projection of terms by SVD

  21. Projection of terms by CD

  22. Queries • Q1: Datamining • Relevant documents : All data mining documents • Q2: Using linearalgebra for datamining • Relevant document: D6

  23. Projection of documents by SVD

  24. Projection of documents by CD

  25. Results of information retrieval (Q1)

  26. Results of information retrieval (Q2)

  27. Collections • MEDLINE • 1033 documents • 30 queries • Relevant judgements • CRANFIELD • 1400 documents • 225 queries • Relevant judgements

  28. Test A • Comparison of errors of approximation term-document matrix by 1) k-rank SVD 2) k-rank CD

  29. MEDLINE - errors of approximation

  30. CRANFIELD - errors of approximation

  31. Test B • Average inner product between concept vectors cj, j=1,2,…,k • Comparison of average inner product for • Concept vectors obtained by spherical k-means algorithm • Concept vectors obtained by fuzzy k-means algorithm

  32. MEDLINE – average inner product

  33. CRANFIELD – average inner product

  34. Test C • Comparison of mean average precision of information retrieval and precision-recall plots • Mean average precision for term-matching method: • MEDLINE : 43,54 • CRANFIELD : 20,89

  35. MEDLINE – mean average precision

  36. CRANFIELD – mean average precision

  37. MEDLINE – precision-recall plot

  38. CRANFIELD – precision-recall plot

  39. Test D • Correlation between mean average precision (MAP) and clustering quality • Measure of cluster quality – generalized within groups sum of square errors function Jfuzz • aj , j=1,2,…,n are vectors of documents, • ci , i=1,2,…,k are concept vectors • ij is the fuzzy membership degree of document aj in the group whose concept isci • b1,is weight exponent

  40. MEDLINE - Correlation (clustering quality and MAP) • 46 observations for rank of approximation k[1,100] • Correlation between mean average precision and Jfuzzis r=-0,968198with significancep<<0,01 • Correlation between rank of approximation and mean average precision is r= 0,70247 ( p<<0,01) • Correlation between rank of approximation and Jfuzzis r= -0,831071 ( p<<0,01)

  41. CRANFILD - Correlation (clustering quality and MAP) • 46 observations for rank of approximation k[1,100] • Correlation between mean average precision and Jfuzzis r=-0,988293 with significancep<<0,01 • Correlation between rank of approximation and mean average precision is r= 0,914489 ( p<<0,01) • Correlation between rank of approximation and Jfuzzis r= -0,904415 ( p<<0,01)

  42. Regression line: clustering quality and MAP (MEDLINE)

  43. Regression line: clustering quality and MAP (CRANFIELD)

  44. Conclusion 1/3 • By SVD approximation term-document matrix is projected on the first k left singular vectors, which for orthogonal base for LSI space • By CD approximation term-document matrix is projected on the k centroids of groups (concept vectors) • Concept vectors form the base for CI space; they tend to orthogonality as k raises • Concept vectors obtained by fuzzy k-means algorithm tend to orthogonality faster then those obtained by spherical k-means algorithm • CI using CD by fuzzy k-means algorithm gives higher MAP of information retrieval then LSI on both collections we have used

  45. Conclusion 2/3 • CI using CD by spherical k-means algorithm gives lower (but comparable) MAP of information retrieval then LSI on both collections we have used • According the results of MAP k=75 for MEDLINE collection, and k=200 for CRANFIELD collection is good choice of rank of approximation • By LSI and CI documents are presented in smaller matrices: • For MEDLINE collection term-document matrix is stored in 5940×1033 matrix – approximations of documents are stored in 75×1033 matrix • For CRANFIELD collection term-document matrix is stored in 4758×1400 matrix - approximations of documents are stored in 200×1400 matrix

  46. Conclusion 3/3 • LSI and CI work better on MEDLINE collection • When evaluated for different ranks of approximation MAP is more stable for LSI then for CI • There is high correlation between MAP and clustering quality

  47. Further work • To apply CI on the problem of classification in supervised setting • To propose solutions of problem adding new documents in collection for CI method • Adding new documents in collection requires recomputation of SVD or CD • It is computationally inefficient • 2 approximation methods are developed for adding new document in collection for LSI method

More Related