Document Clustering 文件分類

Document Clustering文件分類 林頌堅世新大學圖書資訊學系 Sung-Chien LinDepartment of Library and Information StudiesShih-Hsin University

Contents • Researches of Document Clustering • Possible Applications of Document Clustering • Document Clustering in a Networked Environment • Conclusions

Researches of Document Clustering

Document Clustering • Definition • Documents with some similar properties are assigned into automatically created groups • Importance • To improve the efficiency and effectiveness of retrieval • Time • Space • Quality • To determine the structure of the literatures of a field • Exploration of latent information of documents • Reduction of users’ cognition load

Block Diagram DocumentSet DeterminingClustering Parameters Features • Cluster Structure • Nonhierarchical • Hierarchical • Halting Criteria • Number of Desired Clusters • Number of Iteration FeatureExtraction Clustering ClusteredDocuments Applications

Researches on Document Clustering • Features to represent documents • Linguistic structure in documents • Co-occurrences of Terms • Semantic structure • Meta-data of documents • Authors • Citation • Co-citation: document : documents cites the examined documents • bibliographic coupling : documents are cited by the examined documents

Researches on Document Clustering • Measures of relevance between documents • Highly depending on the choice of features to represent documents • Several relevance measures • Vector space model (VSM, Salton) • Latent semantic indexing (LSI, Schütze) • Based on Singular Value Decomposition (SVD) algorithm • Reduction of dimensions of feature vectors in VSM • Exploiting latent semantic feature of documents Measure of relevance between document di and dj wik and wjk: weights of the kth term in di and dj Frequency of the kth term in diInverse document frequency of the kth term L: vocabulary size

Researches on Document Clustering • Clustering algorithms • Agglomerative hierarchical clustering algorithm (AHC) • Algorithm 1. Put each document in the collection into one cluster 2. Identify the two closet clusters and combine these two clusters as a new cluster 3. Repeat Step 2 until that the halting criteria arrive • O(N2) • K-Means algorithm • O(NK) • Buckshot algorithm • Fast, linear time algorithm • A K-Means algorithm where the initial cluster centroids are created by applying AHC to a sample of the document in the collection

Possible Applications of Document Clustering

Query Routing • Documents distributed in several information servers • Relevant documents are clustered and put in one or proximate servers • Generating description to represent all of documents in a cluster • When retrieval takes place • Identifying relevant clusters based on the relevance between queries and description of clusters • Forwarding queries to the servers for those clusters • Merging the results • An example Query: document clustering Library Science Computer Science Zoology Geology

Cluster-based Browsing • The problems of expressing a vague information need as a formal query • Scatter/Gather (Cutting, et. al., SIGIR’92) • Clustering documents into topic-coherent groups • Presenting descriptive summaries of the clusters to users • Users can browse and determine possible clusters hierarchy • Documents in the selected clusters are clustered and summaries are generated • Finally, documents are retrieved Library Science Computer Science Zoology Geology LibraryAutomation InformationRetrieval

Result Set Clustering • Users’ queries are often very short (about 1-3 words) • Result set included relevant documents and also irrelevant documents • Clustering documents in the result set according to the degree of relevance • Helping users figure out their real information needs • Easily retrieving relevant documents • An example Query: Multimedia Video Hypermedia Virtual Reality

Result Set Expansion • Relevant documents may not match the input queries well • Clustering relevant documents based on sophisticated features and clustering algorithms in data preparing phase • Retrieving a core set of documents that match the query • Expanding the results with documents not matching the query but clustered with the documents in the core set Query Core Set Expanding Result Set

Query Refinement • Terms in queries do not match the information needs of users • Dynamically computing and suggesting recall- and precision-enhancing terms for a given query • Term suggestion • Grouping retrieved documents into topic-cohesive clusters • Terms in centroid documents: general concepts • Term in margin documents: specific concepts

Document Clusteringin a Networked Environment

Web Pages vs. Plain Texts • Lexical distributions of these two kinds of documents are significant different • Web pages including more proper nouns and terms but less verbs • Information in web pages may be in a multimedia form • Difficult to represent and retrieve nowadays • Web pages contain rich link information • More than 90% web pages include <A> tags • Each web page contains 15 links in average • Inapplicable to use term-based clustering techniques for plain texts to cluster web pages • Link structure provides useful information to determine relevance among web pages

HTML Tags in Web Pages • Tags provide helpful information to understand the meaning expressed by the pages • Tags for web composition • Bold <B>, Italic <I>, Underline <U>, Font <Font> • Tags for document structures • Title <Title> • Header <Head> • Headline <H1>, <H2>, <H3> • List Items, <Li> • Tags for link structures across pages • Anchor <A> • Terms with tags are information which the authors think important • Terms with tags could be weighted to enhance effectiveness of retrieval

An Example of Web Page List Item Anchor Text Tag <I>

Connectivity Analysis • A link between two pages establishes a relation between the two pages • The similarity between two pages could be estimated using • The length of the shortest path between the two pages • The length between the two pages and their least common ancestor • The length between the two pages and their greatest common descendants A E is more similar to A than D B C D E F G H I J

Information of Link Structure • Authority page: One contains a lot of information about the topic • Authority: If a page p has a link to page q, the authors of page p confer authority on q • link popularity  page authority • Hub page: One has links to authority pages • Mutually reinforcing relationship • A good hub page points to many good authority pages • A good authority page is pointed to by many good hub pages Hubs Authorities

Information of Anchor Text • The text around links pointing to a page is often a description of the page • The information of anchor text could be used to determine the relevance of the link • Distribution of “Yahoo” in anchor texts of 5000 web pages pointing to Yahoo! From: http://decweb.ethz.ch/WWW7/1898/com1898.htm

Conclusions

Conclusions • Document clustering is an important technique to improve efficiency and effectiveness in information retrieval • Possible applications are wide • Technologies of document clustering • Extraction of features to represent documents • Relevance functions between documents • Clustering algorithms • Retrieval of web information rely more and more on the information of the web structure

Important References • P. Willett, “Recent Trends in Hierarchic Document Clustering: A Critical Review,” Information Processing and Management, 24(5), 577-597. • E. Rasmussen, “Clustering Algorithms,” Information Retrieval: Data Structures and Algorithms, ed. by W. B. Frakes and R. Baeza-Yates, Chap. 16, 419-442. • D. R. Cutting, D. Karger and J. O. Pedersen, “A Cluster-based Approach to Browsing Large Document Collection,” Proceedings of SIGIR’92, 318-329. • J. Kleinberg, Authoritative Sources in a Hyperlinked Environment, IBM Research Report RJ 10076, May, 1997.

Document Clustering 文件分類