Clustering the Tagged Web

Clustering the Tagged Web D. Ramage, P. Heymann, C. Manning, & H. Garcia-Molina from Stanford InfoLab ACM Conference on Web Search and Data Mining (WSDM 2009) IDS Lab. Seminar Spring 2009 Mar. 20th, 2009 강 민 석 minsuk@europa.snu.ac.kr

Contents • Introduction • Problem Statement • Main Topics • Experiments • Further Studies • Conclusion

Introduction: Clustering the Web • One of the most promising approaches to handle the inherent ambiguity of the user query is through automatic clustering of web pages. • Cluster hypothesis: “the associations between documents convey information about the relevance of documents to requests” Clustering set of documents

Introduction: Social Bookmarking & Tag • Tags promise a uniquely well suited source of information on the similarity between web documents. • This paper is the first to systematically evaluate how best to use tags for clustering web documents.

Problem Statement • How cantagging data best be used to improve web document clustering?

Contents • Introduction • Problem Statement • Main Topics • Clustering Algorithms • Combine Words & Tags • Evaluation Metric • Experiments • Further Studies • Conclusion

Main Topics “Clustering the Tagged Web” Title Goal Document Clustering for Search Topics Clustering Algorithm Modeling a Document Evaluation Metric K-means MM-LDA How to CombineWords & Tagsin the VSM use ODP use F-score

Clustering Algorithm • partitions a set of web documents into groups of similar documents • similar to a standard task, except each has tags as well as words • look at two algorithms • K-means based on VSM • LDA-derived based on a probabilistic model Clustering # of clusters K set of documents 1,2,…,D a bag of words a bag of tags

K-means clustering • simple and highly scalable clustering algorithm • based on Vector Space Model • clusters documents into one of K groups by iteratively re-assigning • All documents are vectors, and dimensionality is the size of the vocabulary. • Then, How to model the documents? Images from http://www.cs.cmu.edu/~dpelleg/kmeans.html

MM-LDA (Multi-Multinomial Latent Dirichlet Allocation) • A variation of LDA, a generative probabilistic topic model • LDA models each document as a mixture of hidden topic variables,each topic is associated with a distribution over words. • LDA adds fully generative probabilistic semantics to pLSI,which is itself a probabilistic version of LSI, Latent Semantic Indexing. • We extend LDA to jointly account for words and tags as distinct sets of observations. pLSI LDA MM-LDA

MM-LDA (Multi-Multinomial Latent Dirichlet Allocation) • A variation of LDA, a generative probabilistic topic model • LDA models each document as a mixture of hidden topic variables,each topic is associated with a distribution over words. Process generating a collection of tagged documents • For each topic k,draw a multinomial distribution beta_k of size |W|from Dirichlet distribution with parameter etha_w • For each topic k,draw a multinomial distribution gamma_k of size |T|from Dirichlet distribution with parameter etha_t • For each document i,draw a multinomial distribution theta_i of size |K|from Dirichlet distribution with parameter alpha • For each word j in document i,- Draw a topic z_j from theta_i- Draw a word w_j from beta_z_j • For each word j in document i,- Draw a topic z_j from theta_i- Draw a word w_j from beta_z_j Graphical representation of MM-LDA • Step 1,3,4 are equivalent to standard LDA. • In step 2, we construct distributions of tags per topic. • In step 5, we sample a topic for each tag • After the steps (right side), • Learn MM-LDA parameters using Gibbs sampling

Combining Words & Tags • Key Question is “How to model the documents in the VSM?”. • Five ways to model a document with a bag of words and a bag of tags as a vector V. Words Only Tags Only Words + Tags Tags as Words Times n Tags as New Words

Combining Words & Tags • Example • word vocabulary has 8 words, tag vocabulary has 6 tags Words Only Tags Only Words + Tags Tags as Words Times 2 Tags as New Words

Evaluation of Cluster Quality • It is difficult to evaluate clustering algorithms. • Several studies compared their output with a hierarchical web directory. • We derive gold standard clusters from the ODP.

Evaluation of Cluster Quality • compare the generated clusters with the clustering derived from ODPby using the F1 measure • F1 cluster evaluation measure is the harmonic mean of precision and recall

Contents • Introduction • Problem Statement • Main Topics • Experiments • Term Weighing in VSM • How to Combine Words and Tags • Compare MM-LDA and K-means • Further Studies • Conclusion

Term Weighting in the VSM • A document vector V is defined as • Then, how should the weights be assigned? • consider two common functions: tf & tf-idf tf vs. tf-idf weighting on K-means F1-score for 2,000 documents • Conclusion • Words+Tags outperforms words alone under both. • tf on Words+tags outperforms tf-idf on Words+Tags. • tf-idf performs poorly because it over-emphasized the rarest terms

How to Combine Words and Tags with VSM • Which of the five ways to model a document work best in the VSM? • Ten runs of tf weighting on 13,230 documents F1-score for K-means (tf) with several means of combining words and tags • Conclusion • Words+Tags model outperforms any other model. • Tags are a qualitatively different type of content that “just more words”. • K-means can incorporate tagging data as an independent information channel.

How to Combine Words and Tags with MM-LDA • Then… How about MM-LDA? F1-score for MM-LDA with several means of combining words and tags • Conclusion • Also, Words+Tags model outperforms all other configurations. • Interestingly, performance decrease when the addition of tags to the worddue in part to the very different distributional statistics observed for words vs. tags.

Compare MM-LDA and K-means • Which model is better? F-scores for K-means and MM-LDA on 13,320 documents • Conclusion • The inclusion of tagging data improves the performance. • MM-LDA’s Words+Tags model is significantly better than all other models.

Experiments • Highest scoring tags & words from clusters generated by K-means & MM-LDA

Contents • Introduction • Problem Statement • Main Topics • Experiments • Further Studies • Tags vs. Anchor Text • More Specific Subtrees • Conclusion

Tags vs. Anchor text • Q: advantages of tagging data hold up in the presence of anchor text? • A: The inclusion of tagging data would still improve cluster quality. F1 score in the presence of anchor text • Tags are different than anchor text • Performance depressed for Anchors as Words in VSM because of the VSM’s sensitivity to the weights of the now-noisier terms. • Words+Anchors didn’t well because the difficulty of extracting a quality anchor text. • might be improved by down-weighting anchor words or advanced weighting techniques

More Specific Subtrees • Does the impact of tags depend on the specificity of the clustering? • We selected two representative ODP subtrees. • Programming Languages (Top-Programming-Languages category) • Social Sciences (Top/Society/Social Sciences category) F-scores for Programming Languages category F-scores for Social Sciences category • Tags > Words+Tags • Clustering on tags alone outperform alternatives that use word information. • A higher proportion of the remaining tags are direct indicators of sub-category membership

Contents • Introduction • Problem Statement • Main Topics • Experiments • Further Studies • Conclusion

Conclusion • Social tagging data provides a useful source of informationfor web page clustering, a task core to several IR applications. • Tagging data improves the performance compared to clustering on page text alone. • K-means enables it to better exploit the inclusion of tagging data. • A novel algorithm, MM-LDA, makes even better.

Clustering the Tagged Web Thank you~

Clustering the Tagged Web

Clustering the Tagged Web

Presentation Transcript

Clustering Web Search Results

Clustering Web Search Results

Web Document Clustering

Clustering for web documents

tumblr/tagged/h2o2

Clustering tagged documents with labeled and unlabeled documents

Web Document Clustering

The tagged house

Tagged Data

Clustering Web Queries

Web Page Clustering using Heuristic Search in the Web Graph

His-tagged protein

Clustering Personalized Web Search Results

Scalable Web Server Clustering Technologies

Web Page Clustering based on Web Community Extraction

CS728 Clustering the Web Lecture 13

Clustering Applications in Web Mining and Web Personalization

Web Service Clustering

Web clustering Engines

Articles, Tagged With "Midst"

Web Document Clustering

Tagged Data