170 likes | 263 Vues
Creating Concept Hierarchies in a Customer Self-Help System. Bob Wall CS 535 04/29/05. Outline. Introduction / motivation Background Algorithm Feature selection / feature vector generation Hierarchical agglomerative clustering (HAC) Tree partitioning Results / conclusions. Introduction.
E N D
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS 535 04/29/05
Outline • Introduction / motivation • Background • Algorithm • Feature selection / feature vector generation • Hierarchical agglomerative clustering (HAC) • Tree partitioning • Results / conclusions
Introduction • Application – customer self-help (FAQ) system • RightNow Technologies’ Customer Service module • Need ways to organize Knowledge Base (KB) • System already organizes documents (answers) using clustering • Desirable to also organize user queries
Goals • Create concept hierarchy from user queries • Domain-specific • Self-guided (no human intervention / guidance required) • Present hierarchy to help guide users in navigating KB • Demonstrate the types of queries that can be answered by system • Automatically augment searches with related terms
Background • Problem – cluster short text segments • Inadequate information in queries to provide context for clustering • Need some source of context • Possible solution – use Web as source of info • Cilibrasi and Vitanyi proposed mechanism to extract meaning of words using Google searches • Chuang and Chien presented more detailed algorithm for clustering short segments by using text snippets returned by search engine
Algorithm • Use each text segment as input query to search engine • Process resulting text snippets using stemming, stop word lists to extract related terms (keywords) • Select set of keywords, build feature vectors • Cluster using Hierarchical Agglomerative Clustering (HAC) • Compact tree using min-max partitioning
KB-Specific Version – HAC-KB • Choose set of user queries, corresponding answers • Find list of keywords corresponding to those answers • Trim down list to reasonable length • Generate feature vectors • HAC clustering • Min-max partitioning
Available Data • Answers • Documents forming the KB – actually question and answer, plus keywords and other information like product and category associations • Ans_phrases • Extracted from answers, using stop word lists and stemming • One-, two-, and three-word phrases • Counts of occurences in different parts of answer • Keyword_searches • List of user queries – also filtered by stop word lists and stemmed • List of answers matching query
Feature Selection • Select N most frequent user queries • Select set of all answers matching those queries • Select set of all keywords found in those answers • Reduce to list of K keywords • Avoid removing all keywords associated with a query (would generate empty feature vector) • Try to eliminate keywords that provide little discrimimination (ones associated with many queries) • Also eliminate keywords that only map to a single query
Feature Vector Generation • Generate map from queries to keywords, and inverse map from keywords to queries • Use the TF-IDF (term frequency / inverse document frequency) metric for weighting • vi,j is weight of jth keyword for ith query • tfi,j is the number of times that keyword j occurred in list of answers associated with query i • nj is number of queries associated with keyword j • Now have a N x K feature matrix
Standard HAC Algorithm • Initialize clusters – one cluster per query • Initialize similarity matrix • Using the average linkage similarity metric and cosine distance measure • Matrix is upper-triangular
HAC (cont.) • For N – 1 iterations • Pick two root-node clusters with largest similarity • Combine into new root-node cluster • Add new cluster to similarity matrix – compute similarity with all other root-level clusters • Generates tall binary tree of clusters • 2N – 1 nodes • Not particularly usable by humans
Min-Max Partitioning • Need to combine nodes in cluster tree, produce a shallow, bushy multi-way tree • Recursive partitioning algorithm • MinMaxPartition(Cluster sub-tree) • For each possible cut level in tree, compute quality of cut • Choose best-quality cut level • For each subtree cut off, recursively process • Stop at max depth or max cluster size
Choosing Best Cut • Goal is to maximize intra-cluster similarity, minimize inter-cluster similarity • Quality = Q(C) / N(C) • Cluster set quality (smaller is better) • Cluster size preference (gamma distribution)
Issues / Further Work • Resolve issues with data / implementation • Outstanding problem – generating meaningful labels for clusters in hierarchy • Means of measuring performance • Incorporate other KB data, like relevance scores of search results, products/categories • Better feature selection • Fuzzy clustering – query can belong to multiple clusters (Frigui & Masraoui)
References • S.-L. Chuang and L.-F. Chien, “Towards Automatic Generation of Query Taxonomy: A Hierarchical Query Clustering Approach, “Proceedings of ICDM’02, Maebashi City, Japan, Dec. 9-12, 2002, pp. 75–82, 2002. • S.-L. Chuang and L.-F. Chien, “A Practical Web-based Approach to Generating Topic Hierarchy for Text Segments,” Proceedings of CIKM’04, Washington, DC, Nov., 2004, pp. 127-136. • R. Cilibrasi and P. Vitanyi, “Automatic Meaning Discovery Using Google,” published on Web, available at http://arxiv.org/abs/cs/0412098. • H. Frigui and O. Masraoui, “Simultaneous Clustering and Dynamic Keyword Weighting for Text Documents,” in Survey of Text Mining: Clustering, Classification, and Retrieval, Michael W. Berry, ed., Springer-Verlag, New York, 2004, pp. 45-72.