1 / 55

CS533 Information Retrieval

Learn about clustering methods for information retrieval, including heuristic clustering, one-pass assignments, and buckshot clustering. Explore the use of truncated document vectors and cluster-based retrieval. Discover the advantages and disadvantages of bottom-up and top-down search strategies. Also, understand the usage of thesauri in information retrieval and building word-based thesauri.

hutson
Télécharger la présentation

CS533 Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS533 Information Retrieval Dr. Michal Cutler Lecture #19 April 11, 2000

  2. Clustering using existing clusters • Start with a set of clusters • Compute centroids • Compare every item to centroids • Move to most similar cluster • Update centroids • Process ends when all items in best cluster

  3. Example Before After

  4. Heuristic clustering methods • Similarity matrix is not generated • Cluster depends on order of items • Inferior clusters but much faster • Can be used for incremental clustering

  5. One pass assignments • Item 1 placed in cluster • Each subsequent item compared against all clusters (initially against item 1) • Placed into an existing cluster if similar enough • Otherwise into a new cluster

  6. Heuristic clustering methods • Uses similarity between an item and centroid of existing clusters • When item added to cluster its centroid is updated

  7. Buckshot Clustering • Goal: reasonably good k clusters, in O(kn) time (k a constant) • A random sample of documents is used to create k clusters in O(kn) time • Rest of documents are added to the “best” of the k clusters in O(kn) time

  8. Clustering with truncated document vectors • Most expensive step in incremental clustering is computing distance between a new document and all clusters using cosine • Clustering can be done successfully with vectors that contain only a few terms • This will happen when latent semantic indexing is used • An alternative is to discard terms with weights below a threshold

  9. Cluster based retrieval • Given a cluster hierarchy for a collection • A cluster is selected for retrieval when a query is similar to its centroid • The search can be done either top down or bottom up

  10. Cluster Retrieval Issues • A cluster is selected for retrieval when: • The similarity between query and centroid is above a threshold • Who decides on the threshold? • What should it be? • Instead of threshold a user may limit number retrieved to n

  11. Cluster Retrieval Issues • Should all the documents in a selected cluster be retrieved? • If yes, how will they be ranked? • Should each document in a selected cluster be compared to the query?

  12. Cluster based retrieval • Advantage • Relevant documents which do not contain query terms may be retrieved • Retrieval may be fast if only centroids compared to query

  13. Cluster based retrieval • Disadvantage • When whole cluster is returned precision may be low • Clusters may contain relevant documents even when a centroid is not similar to query

  14. Cluster based retrieval • Disadvantages • When clusters are large too many documents may be retrieved • Comparing each document in a selected cluster to query is time consuming

  15. Bottom up search • Query compared to each of the low level centroids (i.e., those that contain documents as well as clusters) • First, the best n low level clusters are selected • Each document in these clusters is compared to the query and the best n documents are chosen

  16. Bottom up search 4 3 (.0) (.7) 5 6 7 9 8 (.8) (.3) (.6) (.5) (.1) A B C D E F G H I J K L M N (.8) (.5) (.3) (.4) (.2) (0) (0) (0) (.9) (.4) (.6) (.8) (.2) (.4) n=3. 3 best clusters are 8 (.8), 4 (.7) and 5 (.6). 3 best documents I (.9), L (.8) and A (.8)

  17. Top down search • Best first search, clusters reached are put into a priority queue. • Search until a cluster with at most n documents is reached • All documents in the cluster are retrieved. • If more documents are needed the best first search continues

  18. Top down search 1 (.2) 2 4 3 (.5) (.7) (.0) 5 6 7 9 8 (.8) (.3) (.6) (.5) (.1) A B C D E F G H I J K L M N (.8) (.5) (.3) (.4) (.2) (0) (0) (0) (.9) (.4) (.6) (.8) (.2) (.4) Cluster 1 is too big. (4 (.7), 2 (.5), 3 (0)) Cluster 4 is too big. (8 (.8), 2 (.5), 9 (.3), 3 (0)) I and J are retrieved. Then A and B.

  19. Thesaurus • A general thesaurus • Domain specific thesauri • Usage in IR • Building a word bases thesaurus

  20. A general thesaurus • A general thesaurus - contains synonyms and antonyms • Different word senses • Sometimes broader terms • Many domain specific terms not included (C++, OS/2, RAM, etc.)

  21. A general thesaurus • Roget http://humanities.uchicago.edu/forms_unrest/ROGET.html • Word based • Provides related terms, and a broader term

  22. Roget search for “car” • Vehicle (broader term) • car, auto, jalopy, clunker, lemon, flivver, coupe, sedan, two-door sedan, four-door sedan, luxury sedan; wheels [coll.], sports • car, roadster, gran turismo[It], jeep, four-wheel drive vehicle, electric • car, ...

  23. A general thesaurus • http://dictionary.langenberg.com/ - Chicago thesaurus • WordNet - a lexical database for English • http://www.cogsci.princeton.edu/~wn • car - noun has 5 senses in WordNet • car - auto, automobile, machine, motorcar • car - railcar...

  24. WordNet Evaluation • "Natural language processing is essential for dealing efficiently with the large quantities of text now available online: fact extraction and summarization, automated indexing and text categorization, and machine translation.”

  25. WordNet Evaluation • “Another essential function is helping the user with query formulation through synonym relationships between words and hierarchical and other relationships between concepts. WordNet supports both of these functions and thus deserves careful study by the digital library community”

  26. Domain specific thesauri • Keywords are usually phrases • Medical thesaurus • http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed. • Law based thesaurus, etc.

  27. A thesaurus for IR • Contains relations between concepts such as broader, narrower, related, etc. • Usually concepts form hierarchies • Thesaurus can be on-line or off-line • Library of Congress, for example, provides a thesaurus to determine search subjects and keywords

  28. A thesaurus for IR - example computer aided instruction see also education UF (used for) teaching machines BT (broader term) educational computing NT (narrower term) TT (top term) computer application RT (related terms) education, teaching

  29. How it is used • Manual indexing - term selection • Automatic indexing - set of synonyms represented by one term • Query formulation - helps users select query terms (important in controlled vocabulary)

  30. How it is used • Query expansion - suggests terms to user • Broaden or narrow a query depending on retrieval results • Automatic query expansion

  31. Building a thesaurus • Manual • Automatic • word based • phrase based

  32. Manual generation • Domain experts • Information experts • Keywords selected • Hierarchy built • Expensive, time consuming and subjective

  33. Automatically built Word based thesauri • Salton • Pederson • Crouch

  34. Corpus-based word-based thesaurus (S) • (Salton 1971) • Main idea: When ti and tj often co-occur in the same documents, they are related. • The terms “trial”, “defendant”, “prosecution” and “judge” will tend to co-occur in the same documents

  35. Corpus-based word-based thesaurus (S) • Uses the term/document weight matrix W • wi,k is the weight assigned to term i in document k • Computes a matrix T • The element in the ith row and jth column of T is the “relation” of term i to term j

  36. Corpus-based word-based thesaurus (S) • N is the number of documents

  37. Corpus-based word-based thesaurus (S) • Note that the nominator is the same for both T(ti, tj) and T(tj, ti) • The denominator however uses the total weight of term i in T(ti, tj), and the total weight of term j in T(tj, ti) • 0 - when no co-occurrence of terms • 1- when co-occurrence vector are equal

  38. Corpus-based word-based thesaurus (S) • The values in the matrix are used to distinguish between, • broad, • narrow and • related terms

  39. Corpus-based word-based thesaurus (S) • The system considers ti and tjrelated when both T(ti, tj) and T(tj, ti) are at least K. • K is a similarity threshold computed experimentally

  40. Corpus-based word-based thesaurus (S) • Broad terms occur in documents more often than narrower ones • ex. “house”, “cottage” • “language”, “French” • The system considers tibroader than tj when T(tj, ti) >=K and T(ti,tj)<K

  41. Example

  42. Example K=1/4, t3 and t4 are related t1 is broader than t2

  43. Corpus-based word-based thesaurus (S) • The formula indicates high co-occurrence, and a larger normalizing factor for ti

  44. Drawback (S) • Terms found related in this method may not be semantically related. • Different functions were also used for term/term similarity (for example inner product to evaluate how related are 2 terms)

  45. Full text collections • Co-occurrence may not be meaningful in large documents which cover many topics • Use co-occurrence in document “window” instead of whole document

  46. Full text collections • The idea is that co-occurrences of terms should be in a small part of the text • A “window” may be a few paragraphs or a “chunk” of q terms

  47. Co-occurrence-based Thesaurus (P) • (Schutze and Pedersen 1997) • Another idea: words with similar meanings co-occur with similar neighbors • "litigation" and "lawsuit" share neighbors such as "court", “judge", “witness” and "proceedings"

  48. Co-occurrence-based Thesaurus (P) • Matrix A is computed for terms that occur 2000-5000 times • Ai,j = number of times words i and j co-occur in collection in windows of size k=40 • These terms are clustered into 200 A-classes (average based clustering)

  49. Co-occurrence-based Thesaurus (P) • A matrix B (200, 20,000) is generated for the 20,000 most frequent terms, based on their co-ocurrence in A-class clusters • Assume one A-class has gA1={ t1, t2, t3} and gA2={t4, t5}. • If term j co-occurs with: • t1 10 times, t2 5 times and t4 6 times • B[1, j]=15, and B[2, j]=6 • The 20,000 terms are now clustered in 200 B-classes (buckshot)

  50. Co-occurrence-based Thesaurus (P) • Now C is formed for all terms • An entry C[i, j] indicates the number of times term j co-occurs with the B-classes • Now SVD is applied on a 200*t matrix • A document is represented by a vector that is the sum of the context vectors (columns in the SVD)

More Related