Web Document Clustering Methods: Overview and Implementation

Web Document Clustering Department of Computer Science and Engineering Southern Methodist University Wenyi Ni

Why web document clustering is needed? • 3.3 billion web pages on the internet • Every time you post a query, the search engine returns thousands of records. • Did you efficiently find what you wanted? • Web document clustering is a good choice. • An example: www.metacrawler.com

How to present a web document in a general model? • TF-IDF • Each web document is consisted by words. • The more words they share, the more likely they are similar. • Each Web document D can be represented by the following form: D = {d1,d2…, dn} Where n means that there are totally n different words in the document collection. • di represents the appearance of the ith word in the document.(1 means exist, 0 means non-exist) • The order of di is determined by the weight.

tfij is number of occurrences of the word tj in the Web document Di. idfj is Inverse document frequency. dfj is the number of Web documents in which word tj occurs in the document collection. n is the total number of Web documents in the document collection. How to calculate the weight?

How to calculate the similarity between two web documents • Jaccard similarity measure: • Other common measures: Cosine, Dice, Overlap

Agglomerative Hierarchical clustering • Start with regarding each document as an individual cluster • Merge the most similar pair of documents or document clusters.(use the similarity measure) • Step 2 is iteratively executed until all objects are contained within a single cluster, which become the root of the tree.

K-means clustering • Arbitrary select K documents as seeds, they are the initial centroids of each cluster. • Assign all other documents to the closest centroid • Compute the centroid of each cluster again. Get new centroid of each cluster • Repeat step2,3, until the centroid of each cluster doesn’t change.

Some other refinement algorithm using TF-IDF model • Biselting K-means • Scatter/Gather

Bisecting K-means 1.Select a cluster to split (There are several ways to select which cluster to split. No significance difference exists in terms of clustering accuracy). We normally choose the largest cluster or the one with the least overall similarity 2.Employ the basic k-means algorithm to subdivide the chosen cluster. 3.Repeat step 2 for a constant number of times. Then perform the split that produces clusters with the highest overall similarity 4.Repeat the above step1,2,3, until the desired number of clusters is reached

How to present a web document in STC model • What is STC? Suffix Tree clustering • The whole web document is treated as a string • The identification of base clusters is the creation of an inverted index of strings for the web document collection

A suffix tree example(courtesy form zemair): • Three strings. Each string is a document. • Cat ate cheese • Mouse ate cheese too • Cat ate mouse too.

STC algorithm(cont) 1.Document cleaning Delete the word prefix and suffix, reduce plural to singular. Sentence boundaries are marked and non-word tokens (such as numbers, HTML tags and most punctuation) are stripped. 2.Identify Base Cluster. Create an inverted index of strings from the web document collection with using a suffix tree. Each node of the suffix tree represents a group of documents and a string that is common to all of them. The label of the node represents the common string. Each node represents a base cluster.

STC algorithm(cont) 3.Score base clusters. Each base cluster is assigned a score • The score formula: S(B)=|B|*f(|P|) • |B| is the number of documents in base cluster B • |P| is the number of words in string P that has a non-zero score • The function f penalizes single word, linear for string that is two to six words long. And become constant for longer string.

STC algorithm 4.Combine base clusters. The similarity measure used to combine base clusters is based on the overlap of their document sets: Bx and By with size |Bx| and |By| |BxBy| represents the number of documents common to both base clusters. Define the similarity of Bx and By to be 1 if: |Bx By|/|Bx|>0.5 and |Bx By|/|By|>0.5. Otherwise is 0. Two base clusters are connected if they have similarity of 1. Using a single-link clustering algorithm, all the connected base clusters are clustering together. All the documents in these base clusters constitute a web document cluster.

Link Based Model • Idea: Web pages that share common links each other are very likely to be tightly related • Each web document P is represented as 2 vectors: Pout(N-dimension) and Pin(M-dimension) • Pout,i represents whether the web document P has a out-link in the ith item of vector Pout • Pin,j represents whether the web document P has a in-link in the jth item of vector Pin For example: Pout( link1, link2,…,linkn) represents all the out-link in web document collection. Document Pout,2= 1 means this document has link2 as out-link.

Link based algorithm 1.Filter irrelevant web documents A document is regarded irrelevant if the sum of in-links and out-links less than 2 2.Use near-common link of cluster to grantee intra-cluster cohesiveness Every cluster should have at least one 30% near common link 3.Assign each web document to cluster, generate base clusters.  Similarity between the document and the corresponding cluster is above the similarity threshold  The document has a link in common with near common links of the corresponding cluster 4.Generate final clusters by merging base clusters

How to evaluate the quality of the result clusters (cont) • Entropy 1)For each cluster, the class distribution of the data(we usually use TREC5,TREC6 document collection) is calculated first. 2)Using this class distribution, the entropy of each cluster j is calculated. Ej = -Spijlog(pij) 3) The best quality is that all the documents in the cluster fall into the same class that is known before clustering

How to evaluate the quality of the result clusters • F-measure 1)Calculate the recall and precision of that cluster for each given class. 2)For cluster j and it’s corresponding class i Recall(i, j) = nij/ni Percision(i, j) = nij/nj F(i, j) = ( 2 * Recall(i, j) * Percision(i, j)) / ((Percision(i, j) + Recall(i, j))

Algorithm evaluation and comparison • TF-IDF based AHC Good cluster quality, time complexity O(n²) • TF-IDF based K-means Linear time complexity O(Kmn) Sensitive to outliers • STC Best for increment. Linear time complexity O(n), has memory problem. • Link based Linear time complexity O(mn), low dimension, good cluster quality.

Future work • Each algorithm has its advantage and disadvantage. We need to refine these algorithms. Sometime we need trade off. • Still some room to make it better. 1.increase the entropy or F-measure value of the result clusters(The evaluation value is under 0.6 in almost all algorithm,while the best is 1) 2.decrease the response time(we often need to process a large document collection. We need a fast algorithm)

End

Web Document Clustering Methods: Overview and Implementation

Web Document Clustering Methods: Overview and Implementation

Presentation Transcript

Mixture Models for Document Clustering

Web Document Clustering

Similarity Measures for Text Document Clustering

Document Clustering via Matrix Representation

Web Mining: Phrase-based Document Indexing and Document Clustering

Document Clustering

Web Document Clustering

Web Document Modeling

Measuring Contribution of HTML Features in Web Document Clustering

Clustering Web Queries

Multitype Features Coselection for Web Document Clustering

Term and Document Clustering

Document Clustering

DOCUMENT CLUSTERING USING HIERARCHICAL ALGORITHM

Web Service Clustering

Web clustering Engines

Web Requirements Document

Document Clustering with Prior Knowledge