350 likes | 513 Vues
Online Clustering of Web Search results. Shixian Chu. Two papers:. O. Zamir and O. Etzioni. Web Document Clustering: A Feasibility Demonstration. In Proceedings of the 21 st International ACM SIGIR Conference on Research andDevelopment in Information Retrieval, Melbourne, Australia,1998.
E N D
Online Clustering of Web Search results Shixian Chu
Two papers: • O. Zamir and O. Etzioni.Web Document Clustering: A Feasibility Demonstration.In Proceedings of the 21st International ACM SIGIR Conference on Research andDevelopment in Information Retrieval, Melbourne, Australia,1998. • Dell Zhang and Yisheng Dong.Semantic, Hierarchical, Online Clustering of Web Search Results,Apr 2004. In Proceedings of the 6th Asia Pacific Web Conference (APWEB), Hangzhou, China
Introduction… Current status of information Retrieval is far from satisfaction for several possible reasons: • Many returned pages are useless or irrelevant; • Users may be just interested in small part of information returned while thousands of pages are returned from search engine; • Different users have different requirements and expectations for search results;
Sometimes search requests can not be expressed clearly just in several keywords; • The phenomena of synonymy (several words may correspond to same concept) and polysemy (one word may have several different meanings) make things more complicated; • ......
Search results clustering can help to solve some of these problems • Search results can be viewed as a database composed of thousand of documents. • All the results are clustered into hierarchical groups with the “key phrases” as the name of the cluster. • With hierarchical clusters, users will be able to have an overview of the whole topic or just select interested clusters to browse and neglect the non-relevant groups.
Example… • Clustered Search results of query “Jaguar”
“Web Document Clustering: A Feasibility Demonstration” O. Zamir and O. Etzioni.
What’s new? This paper introduces linear time (in the document collection size) algorithm called Suffix Tree Clustering(STC), which creates clusters based on phrases shared between documents. STC is faster and more precise than standard clustering methods such as K-means, Buckshot and so on.
Key requirements for Web document clustering methods: • Relevance: relevant and irrelevant docs are in different clusters • Browsable Summaries: key phrases that can summary the cluster • Overlap: one doc maybe in several clusters • Snippet-tolerance: produce high quality clusters even when it only has access to the snippets returned by the search engines • Speed: high
STC has three logical steps: • (1) document “cleaning”, • (2) identifying base clusters using a suffix tree, • (3) combining these base clusters into clusters.
Step 1 - Document "Cleaning" • Deleting word prefixes and suffixes and reducing plural to singular • Marking Sentence boundaries • Stripping non-word tokens (such as numbers,HTML tags and most punctuation)
Step 2 - Identifying Base Clusters We treat documents as strings of words,not characters, thus suffixes contain one or more of the whole words. In more precise terms: • 1. A suffix tree is a rooted, directed tree. • 2. Each internal node has at least 2 children. • 3. Each edge is labeled with a non-empty sub-string
Step 2 - Identifying Base Clusters • 4. No two edges out of the same node can have edge-labels that begin with the same word (hence it is compact). • 5. For each suffix s of S, there exists a suffix-node whose label equals s.
Step 2 - Identifying Base Clusters The following may be the snippets of three search result docs: • "cat ate cheese”---------------document 1 • "mouse ate cheese too" ------document 2 • "cat ate mouse too"-----------document 3
Step 2 - Identifying Base Clusters "cat ate cheese”,"mouse ate cheese too“, "cat ate mouse too"
Step 2 - Identifying Base Clusters • All parent nodes are base clusters
Step 2 - Identifying Base Clusters • Each base cluster is assigned a score • where |B| is the number of documents in base cluster B, • P is the phrase of cluster B, and • |P| is the number of words in P that have a non-zero score • We maintain a stoplist that is supplemented with Internet specific words(e.g., “previous”, “java”, “frames” and “mail”). Words appearing in the stoplist, or that appear in too few (3 or less)or too many (more than 80% of the collection) documents receive a score of zero.
Step 3 - Combining Base Clusters • Given two base clusters Bm and Bn, with sizes |Bm| and |Bn| • |Bm∩Bn| representing the number of documents common to both base clusters 1 if|Bm∩Bn|/|Bm| > 0.5 and |Bm∩Bn|/|Bn| > 0.5 Similarity of Bm and Bn= 0 Otherwise
“Semantic, Hierarchical, Online Clustering of Web Search Results” Dell Zhang and Yisheng Dong.
What’s new? • A document or snippet is treated as a string of characters not as a string of words • Group Web search results semantically • Not only English but also oriental languages like Chinese.
Step 1 - Document "Cleaning" • Deleting word prefixes and suffixes and reducing plural to singular • Marking Sentence boundaries • Stripping non-word tokens (such as numbers,HTML tags and most punctuation)
Step 2 – Key phrase extraction Extract phrases of high 1. “completeness”, 2. “ stability”, and 3. “significance” as Key phrases.
DEFINITION: Completeness • Suppose phrase S occurs in k distinct positions p1, p2, … ,pk in document D, S is “complete” if and only if the (pi-1)th token in D is different with the (pj-1)th token for at least one (i, j) pair, 1≤i<j≤k (called “left-complete”), and the (pi+|S|)th token is different with the (pj+|S|)th token for at least one (i, j) pair, 1≤i<j≤k (called “right-complete”).