Clustering Documents

Clustering Documents

Overview • It is a process of partitioning a set of data in a set of meaningful subclasses. Every data in the subclass shares a common trait. • It helps a user understand the natural grouping or structure in a data set. • Unsupervised Learning • Cluster, category, group, class • No training data that the classifier use to learn how to group • Documents that share same properties are categorized into same clusters • Cluster Size, Number of Clusters, similarity measure • Square root of n if n is the number of documents • LSI

Inter-cluster distances are maximized Intra-cluster distances are minimized What is clustering? • A grouping of data objects such that the objects within a group are similar (or related) to one another and different from (or unrelated to) the objects in other groups

Outliers • Outliers are objects that do not belong to any cluster or form clusters of very small cardinality • In some applications we are interested in discovering outliers, not clusters (outlier analysis) cluster outliers

Why do we cluster? • Clustering : given a collection of data objects group them so that • Similar to one another within the same cluster • Dissimilar to the objects in other clusters • Clustering results are used: • As a stand-alone tool to get insight into data distribution • Visualization of clusters may unveil important information • As a preprocessing step for other algorithms • Efficient indexing or compression often relies on clustering

Applications of clustering? • Image Processing • cluster images based on their visual content • Web • Cluster groups of users based on their access patterns on webpages • Cluster webpages based on their content • Bioinformatics • Cluster similar proteins together (similarity wrt chemical structure and/or functionality etc) • Many more…

The clustering task • Group observations into groups so that the observations belonging in the same group are similar, whereas observations in different groups are different • Basic questions: • What does “similar” mean • What is a good partition of the objects? I.e., how is the quality of a solution measured • How to find a good partition of the observations

Observations to cluster • Real-value attributes/variables • e.g., salary, height • Binary attributes • e.g., gender (M/F), has_cancer(T/F) • Nominal (categorical) attributes • e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.) • Ordinal/Ranked attributes • e.g., military rank (soldier, sergeant, lutenant, captain, etc.) • Variables of mixed types • multiple attributes with various types

Aim of Clustering • Partition unlabeled examples into subsets of clusters, such that: • Examples within a cluster are very similar • Examples in different clusters are very different

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Clustering Example .

Cluster Organization • For “small” number of documents simple/flat clustering is acceptable • Search a smaller set of clusters for relevancy • If cluster is relevant, documents in the cluster are also relevant • Problem: Look for a broader or more specific documents • Hierarchical clustering has a tree-like structure

Dendogram A dendogram presents the progressive, hierarchy-forming merging process pictorially.

Visualization of Dendogram

Example • D1 • Human machine interface for computer applications • D2 • A survey of user opinion of computer system response time • D3 • The EPS user interface management system • D4 • System and human system engineering testing of the EPS system • D5 • The generation of the random binary and ordered trees • D6 • The intersection graphs of paths in a tree • Graph minors: A survey • D7

Broad Specific D3 D5 D6 D7 D2 D4 D1

Cluster Parameters • A minimum and maximum size of clusters • Large cluster size • one cluster attracting many documents • Multi topic themes • A matching threshold value for including documents in a cluster • Minimum degree of similarity • Affects the number of clusters • High threshold • Fewer documents can join a cluster • Larger number of clusters • The degree of overlap between clusters • Some documents deal with more than one topic • Low degree of overlap • Greater separation of clusters • A maximum number of clusters

Cluster-Based Search • Inverted file organization • Query keywords must exactly match word occurrences • Clustered file organization matches a keyword against a set of cluster representatives • Each cluster representative consists of popular words related to a common topic • In flat clustering, query compared against the centroids of the clusters • Centroid : average representative of a group of documents built from the composite text of all member documents

Automatic Document Classification • Searching vs. Browsing • Disadvantages in using inverted index files • information pertaining to a document is scattered among many different inverted-term lists • information relating to different documents with similar term assignment is not in close proximity in the file system • Approaches • inverted-index files (for searching) +clustered document collection (for browsing) • clustered file organization (for searching and browsing)

Typical Clustered File Organization Highest-level centroid Supercentroids Centroids Documents Typical Search path Centroids Documents

Cluster Generation vs. Cluster Search • Cluster generation • Cluster structure is generated only once. • Cluster maintenance can be carried out at relatively infrequent intervals. • Cluster generation process may be slower and more expensive. • Cluster search • Cluster search operations may have to be performed continually. • Cluster search operations must be carried out efficiently.

Hierarchical Cluster Generation • Two strategies • pairwise item similarities • heuristic methods • Models • Divisive Clustering (top down) • The complete collection is assumed to represent one complete cluster. • Then the collection is subsequently broken down into smaller pieces. • Hierarchical Agglomerative Clustering (bottom up) • Individual item similarities are used as a starting point. • A gluing operation collects similar items, or groups, into larger group.

Searching with a taxonomy • Two ways to search a document collection organized in a taxonomy • Top Down Search • Start at the root • Progressively compare query with cluster representative • Single error at higher levels => wrong path => incorrect cluster • Bottom Up Search • Compare query with the most specific cluster at the lowest level • High number of low level clusters increase computation time • Use an inverted index for low level representatives

Aim of Clustering again? • Partitioning data into classes with high intra-class similarity low inter-class similarity • Is it well-defined?

What is Similarity? • Clearly, subjective measure or problem-dependent

How Similar Clusters are? • Ex1: Two clusters or one clusters?

How Similar Clusters are? • Ex2: Cluster or outliers

Similarity Measures • Most cluster methods • use a matrix of similarity computations • Compute similarities between documents Home work: What are the similarity measures used in text mining. Discuss advantages disadvantages. Whenever appropriate comment on the application areas for each similarity measure. List the references and use your own words.

Linking Methods Star Clique String

Clustering Methods • Many methods to compute clusters • NP complete problem • Each solution can be evaluated quickly but exhaustive evaluation of all solutions is not feasible • Each trial may produce a different cluster organization

Stable Clustering • Results should be independent of the initial order of documents • Clusters should not be substantially different when new documents are added to the collection • Results from consecutive runs should not differ significantly

K-Means • Heuristic with complexity O(nlogn) • Matrix based algorithms O(n2) • Begins with an initial set of clusters • Pic the cluster centroids randomly • Use matrix based similarity on a small subset • Use density test to pick cluster centers from sample data • Di is cluster center if at least n other ds have similarity greater than threshold • A set of documents that are sufficiently dissimal must exist in collection

K means Algorithm • Select k documents from the collection to form k initial singleton clusters • Repeat Until termination conditions are satisfied • For every document d, find the cluster i whose centroid is most similar, assign d to cluster i. • For every cluster i, recompute the centroid based on the current member documents • Check for termination—minimal or no changes in the assignment of documents to clusters • Return a list of clusters

Simulated Annealing • Avoids local optima by randomly searching • Downhill move • New solution with higher (better) value than the previous solution • Uphill move • A worse solution is accepted to avoid local minima • The frequency decreases during “life cycle” • Analogy for crystal formation

Simulated Annealing Algorithm • Get initial set of cluster and set the temperature to T • Repeat until the temperature is reduced to the minimum • Run a loop x times • Find a new set of clusters by altering the membership of some documents • Compare the difference between the values of the new and old set of clusters. If there is an improvement, accept the new set of clusters, otherwise accept the new set of clusters with probability p. • Reduce the temperature based on cooling schedule • Return the final set of clusters 2.1 2.1.2 2.1.1 2.2

Simulated Annealing • Simple to implement • Solutions are reasonable good and avoid local minima • Successful in other optimazition tasks • Initial set very important • Adjusting the size of clusters is difficult

Genetic Algorithm • Use a population of solutions 1-Arrange the set of documents in a circle such that documents that are similar to none another are located close to each other 2-Find key documents from the circle and build clusters from a neighborhood of these documents . • Each arrangement of documents is a solution: (chromosome) • Fitness

Genetic Algorithm • Pick two parent solutions x and y from the set of all solutions with preference for solutions with higher fitness score. • Use crossover operation to combine x and y to generate a new solution z. • Periodically mutate a solution by randomly exchanging two documents in a solution.

Learn scatter/gather algorithm

Extra material

Hierarchical Agglomerative Clustering • Basic procedure • 1.Place each of N documents into a class of its own. • 2. Compute all pairwise document-document similarity • coefficients. (N(N-1)/2 coefficients) • 3. Form a new cluster by combining the most similar pair • of current clusters i and j; • update similarity matrix by deleting the rows and • columns corresponding to i and j; • calculate the entries in the row corresponding to the • new cluster i+j. • 4. Repeat step 3 if the number of clusters left is great than 1. 2.1.2

How to Combine Clusters? • Intercluster similarity • Single-link • Complete-link • Group average link • Single-link clustering • Each document must have a similarity exceeding a stated threshold value with at least one other document in the same class. • similarity between a pair of clusters is taken to be the similarity between the most similar pair of items • each cluster member will be more similar to at least one member in that same cluster than to any member of another cluster

How to Combine Clusters? (Continued) • Complete-link Clustering • Each document has a similarity to all other documents in the same class that exceeds the the threshold value. • similarity between the least similar pair of items from the two clusters is used as the cluster similarity • each cluster member is more similar to the most dissimilar member of that cluster than to the most dissimilar member of any other cluster

How to Combine Clusters? (Continued) • Group-average link clustering • a compromise between the extremes of single-link and complete-link systems • each cluster member has a greater average similarity to the remaining members of that cluster than it does to all members of any other cluster

Example for Agglomerative Clustering A-F(6 items)6(6-1)/2(15)pairwise similarities decreasing order

Single Link Clustering A B C D E F A . .3 .5 .6 .8 .9 B .3 . .4 .5 .7 .8 C .5 .4 . .3 .5 .2 D .6 .5 .3 . .4 .1 E .8 .7 .5 .4 . .3 F .9 .8 .2 .1 .3 . AF B C D E AF . .8 .5 .6 .8 B .8 . .4 .5 .7 C .5 .4 . .3 .5 D .6 .5 .3 . .4 E .8 .7 .5 .4 . 0.8 0.9 E A F 0.9 1. AF 0.9 A F sim(AF,X)=max(sim(A,X),sim(F,X)) 2. AE 0.8 sim(AEF,X)=max(sim(AF,X),sim(E,X))

Single Link Clustering (Cont.) AEF B C D AEF . .8 .5 .6 B .8 . .4 .5 C .5 .4 . .3 D .6 .5 .3 . 0.8 3. BF 0.8 0.9 B E Note E and B are on the same level. A F sim(ABEF,X)=max(sim(AEF,X), sim(B,X)) ABEF C D ABEF . .5 .6 C .5 . .3 D .6 .3 . 0.8 4. BE 0.7 0.9 B E A F sim(ABDEF,X)=max(sim(ABEF,X), sim(D,X))

Single Link Clustering (Cont.) 0.6 ABDEF C ABDEF . .5 C .5 . 0.8 D 5. AD 0.6 0.9 B E A F 0.5 C 0.6 0.8 6. AC 0.5 D 0.9 B E A F

Single-Link Clusters • Similarity level 0.7 (i.e., similarity threshold) • ABEF • C • D • Similarity level 0.5 (i.e., similarity threshold) • ABEFCD

Complete-Linke Cluster Generation A B C D E F A . .3 .5 .6 .8 .9 B .3 . .4 .5 .7 .8 C .5 .4 . .3 .5 .2 D .6 .5 .3 . .4 .1 E .8 .7 .5 .4 . .3 F .9 .8 .2 .1 .3 . Complete Link Structure & Pairs Covered Similarity Matrix Step Number Similarity Pair Check Operations new 1. AF 0.9 0.9 A F sim(AF,X)=min(sim(A,X), sim(F,X)) check EF 2. AE 0.8 (A,E) (A,F) 3. BF 0.8 check AB (A,E) (A,F) (B,F)

Complete-Linke Cluster Generation (Cont.) Complete Link Structure & Pairs Covered Similarity Matrix Step Number Similarity Pair Check Operations AF B C D E AF . .3 .2 .1 .3 B .3 . .4 .5 .7 C .2 .4 . .3 .5 D .1 .5 .3 . .4 E .3 .7 .5 .4 . new 0.7 4. BE 0.7 B E check DF (A,D)(A,E)(A,F) (B,E)(B,F) 5. AD 0.6 6. AC 0.6 check CF (A,C)(A,D)(A,E)(A,F) (B,E)(B,F) 7. BD 0.5 check DE (A,C)(A,D)(A,E)(A,F) (B,D)(B,E)(B,F)

Clustering Documents

Clustering Documents

Presentation Transcript

Clustering

Clustering

Clustering

Clustering for web documents

Clustering

Clustering

Clustering tagged documents with labeled and unlabeled documents

Clustering Documents

Clustering

Clustering

Clustering: Partition Clustering

Co-clustering based classification for Out-of-domain Documents

Clustering

Pseudo-supervised Clustering for Text Documents

Clustering of Web Documents Jinfeng Chen

Clustering Documents in a Web Directory

ENHANCING CLUSTERING BLOG DOCUMENTS BY UTILIZING AUTHOR/READER COMMENTS

Clustering

Clustering

Clustering