Text Based Information Retrieval Document Clustering / Classification Lecture 3

Text Based Information RetrievalDocument Clustering / ClassificationLecture 3 Dr. Aboud Madlin Associate Professor Damascus University Faculty of Information Technology

PLAN • Classification In Information Retrieval • What is clustering • Clustering and information access • User benefits from clustering • What is a good clustering? • The cluster hypothesis • Document clustering • Clustering methods • Practical clustering approaches • Adequacy of Clustering Methods • K-means • Hierarchical Agglomerative Clustering • Clustering terms • Labeling clusters • Evaluating clustering

Classification in IR FASTER RETRIEVAL CLUSTERING CLASSIFICATION

Classification in IR Two main areas of application of classification methods in IR • Keyword clustering • Document clustering

Term Associations • Counting word pairs • If two words appear together very often, they are likely to be a phrase • Counting document pairs • If two documents have many common words, they are likely related.

More Counting • Counting citation pairs • If documents A and B both cite document C, D, then A and B might be related. • If documents C and D often be cited together, they are likely related. • Counting link patterns • Get all pages that have links to my pages. • Get all pages that contain similar links to my pages

Google Search Engine • Link analysis • PageRank --The ranking of web pages are based on the number of links that refer to that web page • If page A has a link to B, page A has one vote to B. • The more votes a page get, the more useful the page is. • If page A itself receives many votes, its vote to B will count more heavily • Combining link analysis with word matching.

Concept Link • Use terms’ co-occurring frequencies • to predict semantic relationships • to build concept clusters • to suggest search terms • Visualization of term relationships • Link displays • Map displays • Drag-and drop interface for searching

What is clustering? • Clustering is the process of grouping a set ofphysical or abstract objects into classes ofsimilar objects • “The art of finding groups in data.” • Form of unsupervised learning

What is clustering? By color Clustering example By size By transparency

Clustering and information access • Problem 1: query word could be ambiguous • Solution: VisualizationClustering responses in “similar” groups • Problem 2: user not aware of interesting keywords • Solution: provide documents from same cluster, even if keywords don’tmatch; the cluster hypothesis • Problem 3: Speeding up similarity search • Solution: restrict the search for documents similar to a query to mostrepresentative clusters • Problem 4: Construction of topic hierarchies • Solution: offline clustering of large web samples

User benefits from clustering • Standard IR is like a book index • Document clusters are like a table of contents • People find having a table of contents useful • Table of Contents • 1.Science of Cognition • a. Motivations • a.i. Intellectual Curiosity • a.ii. Practical Applications • b. History of Cognitive Psychology • 2. The Neural Basis of Cognition • a. The Nervous System • b. Organization of the Brain • c. The Visual System • 3. Perception and Attention • a. Sensory Memory • b. Attention and Sensory Information Index Aardvark, 15 Blueberry, 200 Capricorn, 1, 45-55 Dog, 79-99 Egypt, 65 Falafel, 78-90 Giraffes, 45-59

The cluster hypothesis • The relevant documents are more like one another than they like non relevant documents. • Closely associated documents tend to be relevant to the same request • Compute the association between all pairs of documents. • Summing over a set of requests gives the relative distribution of relevant-relevant (R-R) and relevant-non-relevant (R-N-R) associations of a collection. • Plotting the relative frequency against strength of association for two hypothetical collections X and Y we might get distributions

The cluster hypothesis 100 100 80 80 Collection Y Collection X R-N-R 60 60 R-R R-N-R Relative frequency R-R 40 40 20 20 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 From these it is apparent: (a) that the separation for collection X is good while for Y it is poor; and (b) that the strength of the association between relevant documents is greaterfor X than for Y. It is on the basis of this separation that we can decide if using document clustering can lead to effective retrieval than a linear search or not

What is a good clustering? • Internal criterion: A good clustering will produce high quality clusters in which: • the intra-cluster similarity is high • the inter-cluster similarity is low • Measures depend on both the document representation and the similarity measure used • External criterion: The quality of a clustering is also measured by its ability to discover some or all of the hidden patterns or latent classes • Compare with some gold standard data (e.g., manual)

What is a good clustering? External evaluation • Assesses clustering with respect to ground truth • Assume: • there are C gold standard classes • our clustering algorithms produce k clusters, c1, c2, …, ck with n1, n2, …, nk members respectively • Simple measure: purity, the ratio between the size of dominant class in the cluster ci and the size of cluster ci purity(ci) = maxj(nij) / ni j ∈ C • Other measures: • Entropy of classes in clusters • Mutual information between classes and clusters

Purity - example Cluster I : Purity = (max(5, 1, 0)) / 6 = 5/6 Cluster II : Purity = (max(1, 4, 1)) / 6 = 4/6 Cluster III : Purity = (max(2, 0, 3)) / 5 = 3/5

Clustering methods • Naïve method: • Consider all possible cluster partitions • For each, calculate intra/inter class similarity • Select best partition • Complexity: n items, k clusters • How many possible clustering? • kn/k! • 100 documents, 10 classes: 10100/10! = 2.7e93 • On the web: billions of docs, millions of classes

Document clustering two distinct approaches to clustering can be identified: (1) the clustering is based on a measure of similarity between the objects to be clustered; (2) the cluster method proceeds directly from the object descriptions. The most examples of the first approach are the graph theoretic methods which define clusters in terms of a graph derived from the measure of similarity.

Similarity Matrix • Pair wise coupling of similarities among a group of documents S11 S12 S13 S14 S15 S16 S17 S18 S21 S22 S23 S24 S25 S26 S27 S28 S31 S32 S33 S34 S35 S36 S37 S38 S41 S42 S43 S44 S45 S46 S47 S48 S51 S52 S53 S54 S55 S56 S57 S58 S61 S62 S63 S64 S65 S66 S67 S68 S71 S72 S73 S74 S75 S76 S77 S78 S81 S82 S83 S84 S85 S86 S87 S88

Document clustering Example : Consider a set of objects to be clustered. • We compute a numerical value for each pair of objects indicating their similarity. • A graph corresponding to this set of similarity values is obtained as follows: • A threshold value is decided upon, and two objects are considered linked if their similarity value is above the threshold. • The cluster definition is simply made in terms of the graphical representation.

Document clustering 1 2 Graph 5 4 3 6 Example Object : {1,2,3,4,5,6} Similarity matrix Threshold = .89

Document clustering • A string is a connected sequence of objects from some starting point. • A connected component is a set of object such that each object is connected to at least one other member of the set and the set is maximal with respect to this property. • A maximal complete sub-graph is a sub-graph such that each node is connected to every other node in the sub-graph and the set is maximal with respect to this property, i.e. if one node were included anywhere the completeness condition would be violated.

Document clustering Object : {1,2,3,4,5,6} 1 1 string Similarity matrix or 5 4 4 6 6 Connected component 1 5 4 Threshold = .89 6 Maximal complete subgraph 1 5 4

Document clustering Concepts • Cluster representative : (cluster profile, centroid, classification vector) It is simply an object which summaries and represents the objects in the cluster. • The similarity of the objects to the representative is measured by a matching function(sometimes called similarity or correlation function). • The algorithms also use a number ofempirically determined parameters such as: (1) the number of clusters desired; (2) a minimum and maximum size for each cluster; (3) a threshold value on the matching function, below which an object will not be included in a cluster; (4) the control of overlap between clusters;

Document clustering • Almost all of the algorithms are iterative, i.e. the final classification is achieved by iteratively improving an intermediate classification. • Although most algorithms have been defined only for one-level classification, they can obviously be extended to multi-level classification by the simple device of considering the clusters at one level as the objects to be classifiedat the next level. • The most important of this kind of algorithm is Rocchio’s clustering algorithm

Practical clustering approaches • Partitioning approaches(top-down) • K-means • Soft C-means • Self-organizing maps • Hierarchical approaches • Hierarchical agglomerative clustering (HAC) (bottom-up)

Adequacy of Clustering Methods • Produces a clustering which is unlikely to be altered drastically when further objects are incorporated • Stable under growth (robustness) • Stable in the sense that small errors in the description of objects lead only to small changes in the clustering • Independent of the initial ordering of the objects • Methods which satisfy these criteria may not, for other reasons, be the best for a particular application

K-means • Clusters based on centroids (the center of gravity or mean) of points in a cluster: • Need explicit internal representation (e.g., vectors) • Reassignment of instances to clusters is based on distance to the current cluster centroids • In practice, solves the least-squares problem • Limitation: K has to be chosen • Mostly problem-driven

K-means – algorithm • Place K points into the space represented by the objects that are being clustered • Random initial group centroids • For each object, assign it to the nearest centroid • Replace each centroid with the center of gravity of all objects assigned to it • Repeat Steps 2 and 3 until: • The centroids don’t move too much, OR • The document assignments don’t change too much, OR • Fixed number of iterations reached

K-means – algorithm K-means algorithm K-means-cluster (in S : set of vectors : k : integer) { let C[1] ... C[k] be a random partition of S into k parts; repeat { for i := 1 to k { X[i] := centroid of C[i]; C[i] := empty} for j := 1 to N { X[q] := the closest to S[j] of X[1] ... X[k] add S[j] to C[q] } } until the change to C (or the change to X) is small enough. }

K-means • Does not necessarily find optimal clustering • Highly dependent on initial centroids • Popular: choose random K instances • Solutions to dependence on initial selection • Multiple trials (and then pick the best clustering) • Try to spread out initial centroids: • Place first center in a randomly chosen object. • Place second center on object that’s as far away as possible from first center • … • Place j’th center on datapoint that’s as far away as possible from the closest of centers 1 through j-1 • Initialize with results of other clustering method

K-means K-means – optimal example (k=2)

K-means K-means – nonoptimal example (k=3)

K-means - complexity • Time complexity: • Assume computing distance between two instances is O(m) where m is the dimensionality of the vectors. • Reassigning clusters: O(kn) distance computations, or O(knm). • Computing centroids: Each instance vector gets added once to some centroid: O(nm). • Assume these two steps are each done once for i iterations: O(iknm). • Linear in all relevant factors, assuming a fixed number of iterations. • Typically – small number of iterations

Hierarchical AgglomerativeClustering • Assume a similarity function for determining the similarity of two instances • Start with all instances in a separate cluster • Then repeatedly join the two clusters that are most similar until there is only one cluster • The history of merging forms a binary tree or hierarchy

HAC - dendograms • Dendrogram: arrange objects in several levels of nested partitioning (tree of clusters) • Clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster

HAC - algorithm • Start by assigning each item to a cluster, • N objects, N singleton clusters • Find the closest (most similar) pair of clusters and merge them into a single cluster • one cluster less • Compute distances (similarities) between the new cluster and each of the old clusters • Cluster similarity? • Repeat steps 2 and 3 until all items are clustered into a single cluster of size N

HAC - algorithm { put every point in a cluster by itself for I := 1 to N-1 do { let C1,C2 be the most merge-able pair of clusters; create C parent of C1, C2 } }

HAC – cluster distance Many variants to defining closest pair of clusters • Centeroid-based • Clusters whose centroids (centers of gravity) are the most similar • Average-link • Average distance between pairs of elements • Single-link • Similarity of the “closest” points, the most similar • Complete-link • Similarity of the “furthest” points, the least similar

HAC – single link • Use maximum similarity of pairs: • sim(Ci,Cj) = max sim(x,y) x∈Ci , y∈Cj • Can result in “straggly” (long and thin) clusters due to chaining effect • Appropriate in some domains, such as clustering islands • After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is: • sim((Ci ∪ Cj), Ck) = max (sim(Ci,Ck), sim(Cj,Ck)) • No need to re-calculate after merging!

Single link example

HAC – complete link • Use minimum similarity of pairs: • sim(Ci,Cj) = min sim(x,y) x∈Ci , y∈Cj • After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is: • sim((Ci ∪ Cj), Ck) = min (sim(Ci,Ck), sim(Cj,Ck))

HAC – average link • Use average similarity across all pairs within the merged cluster to measure the similarity of two clusters. • sim(Ci,Cj) = ΣxΣysim(x,y) / |Ci∪Cj|⋅(|Ci∪Cj|-1) x,y∈(Ci∪Cj) , x≠y • Compromise between single and complete link • Two options: • Averaged across all ordered pairs in the merged cluster • Averaged over all pairs between the two original clusters • No clear difference in performance

HAC - complexity • In the first iteration, all HAC methods need to compute similarity of all pairs of n individual instances which is O(n2). • In each of the subsequent merging iterations, it must compute the distance between the most recently created cluster and all other existing clusters. • Since we can just store unchanged similarities • In order to maintain an overall O(n2) performance, computing similarity to each other cluster must be done in constant time. • Else O(n2 log n) or O(n3) if done naively

HAC – outliers • Outliers • Unlike K-means, HAC can handle data which is considerably dissimilar from the rest of the data • Real-life example: credit-card fraud

Clustering terms • So far, we clustered docs based on their similarities in term space • For some applications, e.g., topic analysis for inducing navigation structures, do the reverse: • use docs as axes • represent (some) terms as vectors • proximity based on co-occurrence of terms in docs • now clustering terms, not docs

Labeling clusters • After clustering algorithm finds clusters – how can they be useful to the end user? • Need label for each cluster • Often done by hand, a posteriori. • Ideas • Show titles of typical documents • Show words/phrases prominent in cluster • Maybe only noun phrases

Evaluating clustering • Perhaps the most substantive issue in data mining • Approaches: • User inspection • Ground truth comparison • Utility • Each approach has problems

Rocchio’s clustering algorithm • Developed on the Smart project. • It operates in three stages. • the first stage: • it selects (by some criterion) a number of objects as cluster centres. • The remaining objects are then assigned to the centres or to a 'rag-bag' cluster (for the misfits). • On the basis of the initial assignment the cluster representatives are computed and all objects are once more assigned to the clusters. • The assignment rules are explicitly defined in terms of thresholds on a matching function. • The final clusters may overlap (i.e. an object may be assigned to more than one cluster). • The second stage is essentially an iterative step to allow the various input parameters to be adjusted so that the resulting classification meets the prior specification of such things as cluster size, etc. more nearly. • The third stage is for 'tidying up'. Unassigned objects are forcibly assigned, and overlap between clusters is reduced.

Text Based Information Retrieval Document Clustering / Classification Lecture 3

Text Based Information Retrieval Document Clustering / Classification Lecture 3

Presentation Transcript

Approaches to clustering-based analysis and validation

Video Analysis: Annotation technology for retrieval

Introduction to Information Retrieval and Web-based Searching Methods

Artificial Intelligence Approaches for Information Retrieval

Natural Language Processing for Information Retrieval

Text Retrieval Algorithms

Introducing Information Retrieval and Web Search

Lecture 16: Unsupervised Learning from Text

LECTURE 3 Introduction to PCA and PLS K-mean clustering

Information Retrieval

Information Retrieval

Usability of Grouping of Retrieval Results

Aristotle University of Thessaloniki Department of Informatics

Text summarization

CSA3180: Natural Language Processing

Lecture 16: Unsupervised Learning from Text

Learning Embeddings for Similarity-Based Retrieval

Introduction to Classification

Chapter 6. Classification and Prediction

Learning Embeddings for Similarity-Based Retrieval