Query Dependent Ranking Using K-Nearest Neighbor

Query Dependent Ranking Using K-Nearest Neighbor SIGIR 2008 Presenter: Hsuan-Yu Lin

Outline • Introduction • Motivation • Method • Experiments • Conclusion

Introduction • Machine learning techniques proposed in information retrieval • In web search • Queries may vary largely in semantics and the users’ intensions • Goal • To Propose the approach of query-dependent ranking • Use K-Nearest Neighbor(KNN) to find the relative training queries for model training

Motivation • Single ranking model alone couldn’t deal with the cases properly • Not use query classification approach • It is hard to draw clear boundaries between the queries in different cateogries

Motivation • Reduce the feature space from 27 to 2-dimensions by using Principal Component Analysis

Motivation • Why to use KNN approach? • With high probability a query belongs to the same category as those of its neighbors • View KNN as an algorithm performing ‘soft’ classification in the query feature space

KNN Online • Define query feature: • For each query q, using a reference model(BM25)to find its top T ranked documents, and take the mean of the feature values of the T documents as a feature of the query

KNN Online • (b) and (c) cost much time for online algorithm

KNN Offline-1

KNN Offline-2

Time Complexity n: number of documents to be ranked for the test query k: number of nearest neighbors m: number of queries in the training data • It usually spend time on training model

Experiments • Dataset • Two datasets • Dataset 1: 1,500 training queries, 400 test queries • Dataset 2: 3,000 training queries, 800 test queries • Label: 0~5 (perfect, excellent, good, fair, bad) • Feature: 200 • Learning approach • Rank-SVM • Parameter • λ: 0.01 • T(Top T documents for a query):50 • K:400(dataset 1), 800(dataset 2) • Evaluation measure • NDCG

Experiments • Baseline • Single: single model approach • QC: query classification based approach • Classify queries into three categories(topic distillation, name page finding, homepage finding)

Experiments • Dataset 1:

Experiments • Y-axle: change ratio between online and offline method • If change ratio is small, means the two sets have large overlap

Conclusion • Using different models based on different properties of queries • Propose K-Nearest Neighbor approach for selecting training data • Future work • Complexity of offline processing is still high • Use KD-trees or other advanced structures for nearest neighbor search • Query feature definition

Query Dependent Ranking Using K-Nearest Neighbor