200 likes | 335 Vues
This paper explores the challenges and methodologies of clustering web queries, focusing on evaluating clustering results without relying on ground truth labelings. It presents an experimental setup using a dataset from Microsoft adCenter and various clustering algorithms, such as K-means and Spectral clustering. The study compares clustering against manual labelings and introduces classification quality metrics to assess accuracy. In conclusion, it emphasizes the need for multiple representational approaches in clustering and suggests future work to enhance clustering metrics for better performance.
E N D
Clustering Web Queries John S. Whissell, Charles L.A. Clarke, Azin Ashkan CIKM’09 Speaker: Hsin-Lan, Wang Date: 2010/08/31
Outline • Introduction • Experimental Setup • Similarity to Manual Labelings • Classification Quality Metric • Split Discoveries • Clickthrough Analysis Based on Detected Query Categories • General Web Query Clustering • Concluding Discussion
Introduction • Clustering methods suffer from notable problems, including the evaluation of results. • ground truth labelings • objective functions • Goal: evaluate the quality of clustering results • not require comparison to ground truth • not use a specific clustering algorithm’s objective function
Introduction • Clustering Web Queries: • navigational/informational queries • commercial/non-commercial queries
Experimental setup • Data Set • Weighting Methods • Clustering Algorithms
Data Set • Microsoft adCenter • Includes a record of queries entered, ads displayed and ads clicked. • Personally identifying information was removed. • Commercially-oriented: 1700 queries were selected for which the ad click frequency of the query was above 10.
Data Set • For each query, two types of features available: • search engine result page (SERP) • query-specific features
Clustering Algorithms • K-means clustering using Lloyd’s method (kmeans) • Normalized-Cut Spectral clustering (spect) • UPGMA clustering (upgma) • Single Link clustering (slink) • Complete Link clustering (clink) • Document clustering algorithms from Zhao and Karypis: e1, i1, i2, g1, g1p, and h1 objective functions
Classification Quality Metric • Train a classifier to recognize clusters in a clustering. • Classification accuracy (accc): using crossfold validation
Classification Quality Metric • Illustrate a correlation between Na using a linear SVM and internal similarity.
Clickthrough Analysis Based on Detected Query Categories • Clustering+SVM • Clickthrough rate: percentage of queries in that set that had an ad click
Concluding Discussion • Cluster objects using multiple representations and algorithms. • Classification accuracy is used to measure the quality of a clustering. • Future work: extend metric to select the number of clusters