1 / 12

Web Document Clustering

Web Document Clustering. By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ?. Two results for the same query ‘amazon’ Google : currently the most powerful search engine Metacrawler : a search engine which cluster retrieved web documents. 2. Approaches.

maili
Télécharger la présentation

Web Document Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Document Clustering By Sang-Cheol Seok

  2. 1.Introduction: Web document clustering? Why? Two results for the same query ‘amazon’ • Google : currently the most powerful search engine • Metacrawler : a search engine which cluster retrieved web documents.

  3. 2. Approaches • Using contents of documents • Using user’s usage logs • Using current search engines • Using hyperlinks • Other classical methods

  4. (1) Using Contents of Documents • Creating clusters based on snippets returned by web search engines. • clusters based on snippets are almost as good as clusters created using the full text of Web documents. • Suffix Tree Clustering (STC) : incremental, O(n) time algorithm • three logical steps: (1) document “cleaning”, (2) identifying base clusters using a suffix tree, and (3) combining these base clusters into clusters

  5. (2) Using user’s usage logs • Advantage: relevancy information is objectively reflected by the usage logs • An experimental result on www.nasa.gov/

  6. (3) Using current web search engines – Metacrawler • Step1: When MetaCrawler receives a query, it posts the query to multiple search engines in parallel. • Step2: performs sophisticated pruning on the responses returned. (prune 75% of the returned responses as irrelevant, outdated, or unavailable ) • Metacrawler at U. of Washington.

  7. (4) Using hyperlinks • Consider web documents as vertices and the hyperlinks as direct edges in a direct graph. • Similarity-based clustering method was successfully used in image segmentation • Kleinberg’s HITS algorithm • based purely on hyperlink information. • authority and hub documents for a user query. • only cover the most popular topics and leave out the less popular ones.

  8. (4) Using Hyperlinks: continued • cluster web documents based on both the textual and hyperlink • the hyperlink structure is used as the dominant factor in the similarity metric

  9. (5) Other classical clustering methods • K-means method • HAC (hierarchical agglomerative clustering) • DBSCAN (Density-based SCAN) • And Single-link and group-average methods, Complete-link methods, Single-pass methods, and Buckshot and Fraction have been used

  10. 3. Key requirements and future challenges (1) key requirements for Web document clustering methods • Relevance • Browsable Summaries • Overlap • Speed • Incrementality for some methods.

  11. 3. Key requirements and future challenges: continued (2) Concerns on current methods • Each method has pros and cons. • Using hyperlinks : the best accuracy and still some room to improve and it does not overlap. • STC : best to browse and for incrementality. • Metacrawler : best to prune.

  12. 3. Key requirements and future challenges: continued Future challenges • We can not take advantage of all pros of each method. • Some pros work against other pros. • So, we have to trade off. • Moreover, we need to find improvements.

More Related