1 / 24

Document Clustering 文件分類

Document Clustering 文件分類. 林頌堅 世新大學圖書資訊學系 Sung-Chien Lin Department of Library and Information Studies Shih-Hsin University. Contents. Researches of Document Clustering Possible Applications of Document Clustering Document Clustering in a Networked Environment Conclusions.

Télécharger la présentation

Document Clustering 文件分類

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Document Clustering文件分類 林頌堅世新大學圖書資訊學系 Sung-Chien LinDepartment of Library and Information StudiesShih-Hsin University

  2. Contents • Researches of Document Clustering • Possible Applications of Document Clustering • Document Clustering in a Networked Environment • Conclusions

  3. Researches of Document Clustering

  4. Document Clustering • Definition • Documents with some similar properties are assigned into automatically created groups • Importance • To improve the efficiency and effectiveness of retrieval • Time • Space • Quality • To determine the structure of the literatures of a field • Exploration of latent information of documents • Reduction of users’ cognition load

  5. Block Diagram DocumentSet DeterminingClustering Parameters Features • Cluster Structure • Nonhierarchical • Hierarchical • Halting Criteria • Number of Desired Clusters • Number of Iteration FeatureExtraction Clustering ClusteredDocuments Applications

  6. Researches on Document Clustering • Features to represent documents • Linguistic structure in documents • Co-occurrences of Terms • Semantic structure • Meta-data of documents • Authors • Citation • Co-citation: document : documents cites the examined documents • bibliographic coupling : documents are cited by the examined documents

  7. Researches on Document Clustering • Measures of relevance between documents • Highly depending on the choice of features to represent documents • Several relevance measures • Vector space model (VSM, Salton) • Latent semantic indexing (LSI, Schütze) • Based on Singular Value Decomposition (SVD) algorithm • Reduction of dimensions of feature vectors in VSM • Exploiting latent semantic feature of documents Measure of relevance between document di and dj wik and wjk: weights of the kth term in di and dj Frequency of the kth term in diInverse document frequency of the kth term L: vocabulary size

  8. Researches on Document Clustering • Clustering algorithms • Agglomerative hierarchical clustering algorithm (AHC) • Algorithm 1. Put each document in the collection into one cluster 2. Identify the two closet clusters and combine these two clusters as a new cluster 3. Repeat Step 2 until that the halting criteria arrive • O(N2) • K-Means algorithm • O(NK) • Buckshot algorithm • Fast, linear time algorithm • A K-Means algorithm where the initial cluster centroids are created by applying AHC to a sample of the document in the collection

  9. Possible Applications of Document Clustering

  10. Query Routing • Documents distributed in several information servers • Relevant documents are clustered and put in one or proximate servers • Generating description to represent all of documents in a cluster • When retrieval takes place • Identifying relevant clusters based on the relevance between queries and description of clusters • Forwarding queries to the servers for those clusters • Merging the results • An example Query: document clustering Library Science Computer Science Zoology Geology

  11. Cluster-based Browsing • The problems of expressing a vague information need as a formal query • Scatter/Gather (Cutting, et. al., SIGIR’92) • Clustering documents into topic-coherent groups • Presenting descriptive summaries of the clusters to users • Users can browse and determine possible clusters hierarchy • Documents in the selected clusters are clustered and summaries are generated • Finally, documents are retrieved Library Science Computer Science Zoology Geology LibraryAutomation InformationRetrieval

  12. Result Set Clustering • Users’ queries are often very short (about 1-3 words) • Result set included relevant documents and also irrelevant documents • Clustering documents in the result set according to the degree of relevance • Helping users figure out their real information needs • Easily retrieving relevant documents • An example Query: Multimedia Video Hypermedia Virtual Reality

  13. Result Set Expansion • Relevant documents may not match the input queries well • Clustering relevant documents based on sophisticated features and clustering algorithms in data preparing phase • Retrieving a core set of documents that match the query • Expanding the results with documents not matching the query but clustered with the documents in the core set Query Core Set Expanding Result Set

  14. Query Refinement • Terms in queries do not match the information needs of users • Dynamically computing and suggesting recall- and precision-enhancing terms for a given query • Term suggestion • Grouping retrieved documents into topic-cohesive clusters • Terms in centroid documents: general concepts • Term in margin documents: specific concepts

  15. Document Clusteringin a Networked Environment

  16. Web Pages vs. Plain Texts • Lexical distributions of these two kinds of documents are significant different • Web pages including more proper nouns and terms but less verbs • Information in web pages may be in a multimedia form • Difficult to represent and retrieve nowadays • Web pages contain rich link information • More than 90% web pages include <A> tags • Each web page contains 15 links in average • Inapplicable to use term-based clustering techniques for plain texts to cluster web pages • Link structure provides useful information to determine relevance among web pages

  17. HTML Tags in Web Pages • Tags provide helpful information to understand the meaning expressed by the pages • Tags for web composition • Bold <B>, Italic <I>, Underline <U>, Font <Font> • Tags for document structures • Title <Title> • Header <Head> • Headline <H1>, <H2>, <H3> • List Items, <Li> • Tags for link structures across pages • Anchor <A> • Terms with tags are information which the authors think important • Terms with tags could be weighted to enhance effectiveness of retrieval

  18. An Example of Web Page List Item Anchor Text Tag <I>

  19. Connectivity Analysis • A link between two pages establishes a relation between the two pages • The similarity between two pages could be estimated using • The length of the shortest path between the two pages • The length between the two pages and their least common ancestor • The length between the two pages and their greatest common descendants A E is more similar to A than D B C D E F G H I J

  20. Information of Link Structure • Authority page: One contains a lot of information about the topic • Authority: If a page p has a link to page q, the authors of page p confer authority on q • link popularity  page authority • Hub page: One has links to authority pages • Mutually reinforcing relationship • A good hub page points to many good authority pages • A good authority page is pointed to by many good hub pages Hubs Authorities

  21. Information of Anchor Text • The text around links pointing to a page is often a description of the page • The information of anchor text could be used to determine the relevance of the link • Distribution of “Yahoo” in anchor texts of 5000 web pages pointing to Yahoo! From: http://decweb.ethz.ch/WWW7/1898/com1898.htm

  22. Conclusions

  23. Conclusions • Document clustering is an important technique to improve efficiency and effectiveness in information retrieval • Possible applications are wide • Technologies of document clustering • Extraction of features to represent documents • Relevance functions between documents • Clustering algorithms • Retrieval of web information rely more and more on the information of the web structure

  24. Important References • P. Willett, “Recent Trends in Hierarchic Document Clustering: A Critical Review,” Information Processing and Management, 24(5), 577-597. • E. Rasmussen, “Clustering Algorithms,” Information Retrieval: Data Structures and Algorithms, ed. by W. B. Frakes and R. Baeza-Yates, Chap. 16, 419-442. • D. R. Cutting, D. Karger and J. O. Pedersen, “A Cluster-based Approach to Browsing Large Document Collection,” Proceedings of SIGIR’92, 318-329. • J. Kleinberg, Authoritative Sources in a Hyperlinked Environment, IBM Research Report RJ 10076, May, 1997.

More Related