1 / 25

Clustering Web Search Results

Iwona Bialynicka-Birula - Clustering Web Search Results. Overview. What is clustering?Applying clustering to web search resultsClustering algorithmsCase studiesRelated topics not coveredClusteringClustering in generalDocument clustering in generalOther search and browsing aidsClassification

wyome
Télécharger la présentation

Clustering Web Search Results

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Clustering Web Search Results Iwona Bialynicka-Birula

    2. Iwona Bialynicka-Birula - Clustering Web Search Results Overview What is clustering? Applying clustering to web search results Clustering algorithms Case studies Related topics not covered Clustering Clustering in general Document clustering in general Other search and browsing aids Classification Visualization Query expansion

    3. Iwona Bialynicka-Birula - Clustering Web Search Results Clustering the act of grouping similar object into sets In the web search context: organizing web pages (search results) into groups, so that different groups correspond to different user needs search engine i.e.: engine car part Engine Corp. What is clustering?

    4. Iwona Bialynicka-Birula - Clustering Web Search Results Clustering vs. Classification Classification assigns objects to predefined groups Clustering infers groups based on clustered objects

    5. Iwona Bialynicka-Birula - Clustering Web Search Results Why cluster web search results? Flat ranked list not enough Documents pertaining to different topics cannot be compared Relationships between the results Cluster Hypothesis (van Rijsbergen 1979): Closely related documents tend to be relevant to the same requests. Aids user-engine interaction Browsing Help user express his need

    6. Iwona Bialynicka-Birula - Clustering Web Search Results Why not just document clustering? Web search results clustering is a version of document clustering, but Billions of pages Constantly changing Data mainly unstructured and heterogeneous Additional information to consider (i.e. links, click-through data, etc.)

    7. Iwona Bialynicka-Birula - Clustering Web Search Results Some requirements Fast Immediate response to query Flexible Web content changes constantly User-oriented Main goal is to aid the user in finding sought information

    8. Iwona Bialynicka-Birula - Clustering Web Search Results Main issues Online or offline clustering? What to use as input Entire documents Snippets Structure information (links) Other data (i.e. click-through) Use stop word lists, stemming, etc. How to define similarity? Content (i.e. vector-space model) Link analysis Usage statistics How to group similar documents? How to label the groups?

    9. Iwona Bialynicka-Birula - Clustering Web Search Results Clustering algorithms Flat or hierarchical? Overlapping? Hard or soft? Incremental? Predefined cluster number? Requiring explicit similarity measure? Distance measure?

    10. Iwona Bialynicka-Birula - Clustering Web Search Results Clustering algorithms Distance-based Hierarchical Agglomerative Hierarchical Clustering (AHC) Flat K-means (can be fuzzy) Single-pass (incremental) Other Suffix Tree Clustering (Grouper) Self-organizing (Kohonen) maps (neural networks) Latent Semantic Indexing (LSI) (reducing the dimensionality of the vector-space)

    11. Iwona Bialynicka-Birula - Clustering Web Search Results Agglomerative hierarchical clustering

    12. Iwona Bialynicka-Birula - Clustering Web Search Results Clustering result: dendrogram

    13. Iwona Bialynicka-Birula - Clustering Web Search Results AHC variants Various ways of calculating cluster similarity

    14. Iwona Bialynicka-Birula - Clustering Web Search Results K-means clustering (k=3)

    15. Iwona Bialynicka-Birula - Clustering Web Search Results Single-pass

    16. Iwona Bialynicka-Birula - Clustering Web Search Results Selected systems Scatter/Gather Grouper Carrot2 Vivisimo Mapuccino (Su et. al. 2001) SHOC

    17. Iwona Bialynicka-Birula - Clustering Web Search Results Scatter/Gather (Cutting et. al. 1992) Designed for browsing Based on two novel clustering algorithms Buckshot fast for online clustering Fractionation accurate for offline initial clustering of the entire set

    18. Iwona Bialynicka-Birula - Clustering Web Search Results Grouper (Zamir and Etzioni 1997, 1999) Online Operates on query result snippets Clusters together documents with large common subphrases Suffix Tree Clustering (STC) STC induces labeling

    19. Iwona Bialynicka-Birula - Clustering Web Search Results Suffix Tree Clustering (STC) Linear Incremental Overlapping Can be extended to hierarchical

    20. Iwona Bialynicka-Birula - Clustering Web Search Results STC algorithm Step 1: Cleaning Stemming Sentence boundary identification Punctuation elimination Step 2: Suffix tree construction Produces base clusters (internal nodes) Base clusters are scored based on size and phrase score (which depends on length and word quality) Step 3: Merging base clusters Highly overlapping clusters are merged

    21. Iwona Bialynicka-Birula - Clustering Web Search Results Carrot2 (Stefanowski and Weiss 2003) http://www.cs.put.poznan.pl/dweiss/carrot/ Component framework Allows substituting components for Input (i.e. snippets from other search engines) Filter Stemming Distance measure Clustering Output

    22. Iwona Bialynicka-Birula - Clustering Web Search Results Vivsimo Commercial http://www.vivisimo.com/ Online Hierarchical Conceptual

    23. Iwona Bialynicka-Birula - Clustering Web Search Results Other Mapuccino (IBM) (Maarek et. al. 2000) http://www.alphaworks.ibm.com/tech/mapuccino Relatively efficient AHC (O(n2)) Similarity based on vector-space model (Su et. al. 2001) Only usage statistics used as input Recursive Density Based Clustering SHOC (Zhang and Dong 2004) Grouper-like Key phrase discovery

    24. Iwona Bialynicka-Birula - Clustering Web Search Results References Douglass Cutting, David Karger, Jan Pedersen, and John W. Tukey, Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, 1992. Proceedings of the 15th Annual International ACM/SIGIR Conference, Copenhagen. O. Zamir and O. Etzioni, Grouper: a dynamic clustering interface to web search results, May 1999. In Proceedings of the Eighth International World Wide Web Conference, Toronto, CanadaM. Steinbach, G. Y.S. Maarek, R. Fagin, I.Z. Ben-Shaul, D. Pelleg, Ephemeral document clustering for web applications, 2000. Technical Report RJ 10186, IBM Research Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation-based Document Clustering using Web Logs, 2001. J. Stefanowski, D. Weiss. Carrot2 and Language Properties in Web Search Results Clustering, 2003. In: Lecture Notes in Artificial Intelligence: Advances in Web Intelligence, Proceedings of the First International Atlantic Web Intelligence Conference, Madrit, Spain, vol. 2663 (), pp. 240249 Dell Zhang, Yisheng Dong. Semantic, Hierarchical, Online Clustering of Web Search Results, Apr 2004. In Proceedings of the 6th Asia Pacific Web Conference (APWEB), Hangzhou, China

    25. Iwona Bialynicka-Birula - Clustering Web Search Results Thank you Questions? http://www.di.unipi.it/~iwona/Clustering.ppt

More Related