1 / 61

Lecture 8: Clustering

Prof. Ray Larson University of California, Berkeley School of Information. Lecture 8: Clustering. Principles of Information Retrieval. Some slides in this lecture were originally created by Prof. Marti Hearst. Mini-TREC. Need to make groups Today Systems SMART (not recommended…)

phoebe
Télécharger la présentation

Lecture 8: Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prof. Ray Larson University of California, Berkeley School of Information Lecture 8: Clustering Principles of Information Retrieval Some slides in this lecture were originally created by Prof. Marti Hearst

  2. Mini-TREC • Need to make groups • Today • Systems • SMART (not recommended…) • ftp://ftp.cs.cornell.edu/pub/smart • MG (We have a special version if interested) • http://www.mds.rmit.edu.au/mg/welcome.html • Cheshire II & 3 • II = ftp://cheshire.berkeley.edu/pub/cheshire & http://cheshire.berkeley.edu • 3 = http://cheshire3.sourceforge.org • Zprise (Older search system from NIST) • http://www.itl.nist.gov/iaui/894.02/works/zp2/zp2.html • IRF (new Java-based IR framework from NIST) • http://www.itl.nist.gov/iaui/894.02/projects/irf/irf.html • Lemur • http://www-2.cs.cmu.edu/~lemur • Lucene (Java-based Text search engine) • http://jakarta.apache.org/lucene/docs/index.html • Others?? (See http://searchtools.com )

  3. Mini-TREC • Proposed Schedule • February 16 – Database and previous Queries • February 25 – report on system acquisition and setup • March 9, New Queries for testing… • April 20, Results due • April 27, Results and system rankings • May 6 Group reports and discussion

  4. Review: IR Models • Set Theoretic Models • Boolean • Fuzzy • Extended Boolean • Vector Models (Algebraic) • Probabilistic Models (probabilistic)

  5. Similarity Measures Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient

  6. Documents in Vector Space t3 D1 D9 D11 D5 D3 D10 D4 D2 t1 D7 D6 D8 t2

  7. Vector Space with Term Weights and Cosine Matching Di=(di1,wdi1;di2, wdi2;…;dit, wdit) Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit) Term B 1.0 Q = (0.4,0.8) D1=(0.8,0.3) D2=(0.2,0.7) Q D2 0.8 0.6 0.4 D1 0.2 0 0.2 0.4 0.6 0.8 1.0 Term A

  8. Term Weights in SMART • In SMART weights are decomposed into three factors:

  9. SMART Freq Components Binary maxnorm augmented log

  10. Collection Weighting in SMART Inverse squared probabilistic frequency

  11. Term Normalization in SMART sum cosine fourth max

  12. Problems with Vector Space • There is no real theoretical basis for the assumption of a term space • it is more for visualization than having any real basis • most similarity measures work about the same regardless of model • Terms are not really orthogonal dimensions • Terms are not independent of all other terms

  13. Today • Clustering • Automatic Classification • Cluster-enhanced search

  14. Overview • Introduction to Automatic Classification and Clustering • Classification of Classification Methods • Classification Clusters and Information Retrieval in Cheshire II • DARPA Unfamiliar Metadata Project

  15. Classification • The grouping together of items (including documents or their representations) which are then treated as a unit. The groupings may be predefined or generated algorithmically. The process itself may be manual or automated. • In document classification the items are grouped together because they are likely to be wanted together • For example, items about the same topic.

  16. Automatic Indexing and Classification • Automatic indexing is typically the simple deriving of keywords from a document and providing access to all of those words. • More complex Automatic Indexing Systems attempt to select controlled vocabulary terms based on terms in the document. • Automatic classification attempts to automatically group similar documents using either: • A fully automatic clustering method. • An established classification scheme and set of documents already indexed by that scheme.

  17. Background and Origins • Early suggestion by Fairthorne • “The Mathematics of Classification” • Early experiments by Maron (1961) and Borko and Bernick(1963) • Work in Numerical Taxonomy and its application to Information retrieval Jardine, Sibson, van Rijsbergen, Salton (1970’s). • Early IR clustering work more concerned with efficiency issues than semantic issues.

  18. Document Space has High Dimensionality • What happens beyond three dimensions? • Similarity still has to do with how many tokens are shared in common. • More terms -> harder to understand which subsets of words are shared among similar documents. • One approach to handling high dimensionality: Clustering

  19. Vector Space Visualization

  20. Cluster Hypothesis • The basic notion behind the use of classification and clustering methods: • “Closely associated documents tend to be relevant to the same requests.” • C.J. van Rijsbergen

  21. Classification of Classification Methods • Class Structure • Intellectually Formulated • Manual assignment (e.g. Library classification) • Automatic assignment (e.g. Cheshire Classification Mapping) • Automatically derived from collection of items • Hierarchic Clustering Methods (e.g. Single Link) • Agglomerative Clustering Methods (e.g. Dattola) • Hybrid Methods (e.g. Query Clustering)

  22. Classification of Classification Methods • Relationship between properties and classes • monothetic • polythetic • Relation between objects and classes • exclusive • overlapping • Relation between classes and classes • ordered • unordered Adapted from Sparck Jones

  23. Properties and Classes • Monothetic • Class defined by a set of properties that are both necessary and sufficient for membership in the class • Polythetic • Class defined by a set of properties such that to be a member of the class some individual must have some number (usually large) of those properties, and that a large number of individuals in the class possess some of those properties, and no individual possesses all of the properties.

  24. Monothetic vs. Polythetic A B C D E F G H 1 + + + 2 + + + 3 + + + 4 + + + 5 + + + 6 + + + 7 + + + 8 + + + Polythetic Monothetic Adapted from van Rijsbergen, ‘79

  25. Exclusive Vs. Overlapping • Item can either belong exclusively to a single class • Items can belong to many classes, sometimes with a “membership weight”

  26. Ordered Vs. Unordered • Ordered classes have some sort of structure imposed on them • Hierarchies are typical of ordered classes • Unordered classes have no imposed precedence or structure and each class is considered on the same “level” • Typical in agglomerative methods

  27. Text Clustering Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu Term 1 Term 2

  28. Text Clustering Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu Term 1 Term 2

  29. Text Clustering • Finds overall similarities among groups of documents • Finds overall similarities among groups of tokens • Picks out some themes, ignores others

  30. Simple Dice’s coefficient Jaccard’s coefficient Cosine coefficient Overlap coefficient Coefficients of Association

  31. Pair-wise Document Similarity How to compute document similarity?

  32. Pair-wise Document Similarity(no normalization for simplicity)

  33. Pair-wise Document Similarity cosine normalization

  34. Document/Document Matrix

  35. Clustering Methods • Hierarchical • Agglomerative • Hybrid • Automatic Class Assignment

  36. Hierarchic Agglomerative Clustering • Basic method: • 1) Calculate all of the interdocument similarity coefficients • 2) Assign each document to its own cluster • 3) Fuse the most similar pair of current clusters • 4) Update the similarity matrix by deleting the rows for the fused clusters and calculating entries for the row and column representing the new cluster (centroid) • 5) Return to step 3 if there is more than one cluster left

  37. Hierarchic Agglomerative Clustering A B C D E F G H I

  38. Hierarchic Agglomerative Clustering A B C D E F G H I

  39. Hierarchic Agglomerative Clustering A B C D E F G H I

  40. Hierarchical Methods 2 .4 3 .4 .2 4 .3 .3 .3 5 .1 .4 .4 .1 1 2 3 4 Single Link Dissimilarity Matrix Hierarchical methods: Polythetic, Usually Exclusive, Ordered Clusters are order-independent

  41. Threshold = .1 2 .4 3 .4 .2 4 .3 .3 .3 5 .1 .4 .4 .1 1 2 3 4 2 0 3 0 0 4 0 0 0 5 1 0 0 1 1 2 3 4 1 2 5 3 4 Single Link Dissimilarity Matrix

  42. Threshold = .2 2 .4 3 .4 .2 4 .3 .3 .3 5 .1 .4 .4 .1 1 2 3 4 2 0 3 0 1 4 0 0 0 5 1 0 0 1 1 2 3 4 1 2 5 3 4

  43. Threshold = .3 2 .4 3 .4 .2 4 .3 .3 .3 5 .1 .4 .4 .1 1 2 3 4 2 0 3 0 1 4 1 1 1 5 1 0 0 1 1 2 3 4 1 2 5 3 4

  44. K-means & Rocchio Clustering Doc Doc Doc Doc Doc Doc Doc Doc Agglomerative methods: Polythetic, Exclusive or Overlapping, Unordered clusters are order-dependent. Rocchio’s method 1. Select initial centers (I.e. seed the space) 2. Assign docs to highest matching centers and compute centroids 3. Reassign all documents to centroid(s)

  45. K-Means Clustering • 1 Create a pair-wise similarity measure • 2 Find K centers using agglomerative clustering • take a small sample • group bottom up until K groups found • 3 Assign each document to nearest center, forming new clusters • 4 Repeat 3 as necessary

  46. Scatter/Gather Cutting, Pedersen, Tukey & Karger 92, 93, Hearst & Pedersen 95 • Cluster sets of documents into general “themes”, like a table of contents • Display the contents of the clusters by showing topical terms and typical titles • User chooses subsets of the clusters and re-clusters the documents within • Resulting new groups have different “themes”

  47. S/G Example: query on “star” Encyclopedia text 14 sports 8 symbols 47 film, tv 68 film, tv (p) 7 music 97 astrophysics 67 astronomy(p) 12 stellar phenomena 10 flora/fauna 49 galaxies, stars 29 constellations 7 miscelleneous Clustering and re-clustering is entirely automated

More Related